[jira] [Closed] (HUDI-6022) The method param `instantTime` of org.apache.hudi.table.action.commit.BaseFlinkCommitActionExecutor#handleUpsertPartition is redundant
[ https://issues.apache.org/jira/browse/HUDI-6022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-6022. Fix Version/s: 0.14.0 Resolution: Fixed Fixed via master branch: 9288fdc456f9a4215d32908756a4ddaee18abfc4 > The method param `instantTime` of > org.apache.hudi.table.action.commit.BaseFlinkCommitActionExecutor#handleUpsertPartition > is redundant > -- > > Key: HUDI-6022 > URL: https://issues.apache.org/jira/browse/HUDI-6022 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Jianhui Dong >Priority: Major > Labels: easyfix, pull-request-available > Fix For: 0.14.0 > > > We have stored the `instantTime` in the superclass BaseActionExector, and > there's no need to keep a method param 'instantTime`, it's preferred to > remove it to make code cleaner. > {code:java} > protected Iterator> handleUpsertPartition( > String instantTime, > String partitionPath, > String fileIdHint, > BucketType bucketType, > Iterator recordItr) { > try { > if (this.writeHandle instanceof HoodieCreateHandle) { > // During one checkpoint interval, an insert record could also be > updated, > // for example, for an operation sequence of a record: > //I, U, | U, U > // - batch1 - | - batch2 - > // the first batch(batch1) operation triggers an INSERT bucket, > // the second batch batch2 tries to reuse the same bucket > // and append instead of UPDATE. > return handleInsert(fileIdHint, recordItr); > } else if (this.writeHandle instanceof HoodieMergeHandle) { > return handleUpdate(partitionPath, fileIdHint, recordItr); > } else { > switch (bucketType) { > case INSERT: > return handleInsert(fileIdHint, recordItr); > case UPDATE: > return handleUpdate(partitionPath, fileIdHint, recordItr); > default: > throw new AssertionError(); > } > } > } catch (Throwable t) { > String msg = "Error upsetting bucketType " + bucketType + " for partition > :" + partitionPath; > LOG.error(msg, t); > throw new HoodieUpsertException(msg, t); > } > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[hudi] branch master updated (5d5658347ad -> 9288fdc456f)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from 5d5658347ad [HUDI-5983] Improve loading data via cloud store incr source (#8290) add 9288fdc456f [HUDI-6022] Remove redundant method param of BaseFlinkCommitActionExecutor (#8363) No new revisions were added by this update. Summary of changes: .../apache/hudi/table/action/commit/BaseFlinkCommitActionExecutor.java | 2 -- 1 file changed, 2 deletions(-)
[GitHub] [hudi] danny0405 merged pull request #8363: [HUDI-6022] Remove redundant method param of BaseFlinkCommitActionExec…
danny0405 merged PR #8363: URL: https://github.com/apache/hudi/pull/8363 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xiarixiaoyao closed pull request #8322: [WIP]spark should pass InstantRange to incremental query for log files
xiarixiaoyao closed pull request #8322: [WIP]spark should pass InstantRange to incremental query for log files URL: https://github.com/apache/hudi/pull/8322 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8375: [MINOR]Remove the redundancy config
hudi-bot commented on PR #8375: URL: https://github.com/apache/hudi/pull/8375#issuecomment-1495442490 ## CI report: * 3a3da94e83aa8d193a7a7351e4c113999a8197b0 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16112) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8351: [HUDI-6013] Support database name for meta sync in bootstrap
hudi-bot commented on PR #8351: URL: https://github.com/apache/hudi/pull/8351#issuecomment-1495442311 ## CI report: * 62cce26c004b5dabd45271bda4141a730ddad6cb Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16052) * 1b46aa826f2f6733595fa26461aa5fa2ef00199d UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on issue #8366: [SUPPORT] Flink streaming write to Hudi table using data stream API java.lang.NoClassDefFoundError:
danny0405 commented on issue #8366: URL: https://github.com/apache/hudi/issues/8366#issuecomment-1495439988 Which class is missing here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-5955) Incremental clean does not work with archived commits
[ https://issues.apache.org/jira/browse/HUDI-5955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-5955: - Summary: Incremental clean does not work with archived commits (was: fix incremental clean not work cause by archive) > Incremental clean does not work with archived commits > - > > Key: HUDI-5955 > URL: https://issues.apache.org/jira/browse/HUDI-5955 > Project: Apache Hudi > Issue Type: Bug >Reporter: HBG >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] danny0405 commented on a diff in pull request #8373: [HUDI-5955] Incremental clean does not work with archived commits
danny0405 commented on code in PR #8373: URL: https://github.com/apache/hudi/pull/8373#discussion_r1156802426 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java: ## @@ -165,7 +165,8 @@ private List getPartitionPathsForCleanByCommits(Option in HoodieCleanMetadata cleanMetadata = TimelineMetadataUtils .deserializeHoodieCleanMetadata(hoodieTable.getActiveTimeline().getInstantDetails(lastClean.get()).get()); if ((cleanMetadata.getEarliestCommitToRetain() != null) - && (cleanMetadata.getEarliestCommitToRetain().length() > 0)) { + && (cleanMetadata.getEarliestCommitToRetain().length() > 0) + && !hoodieTable.getActiveTimeline().isBeforeTimelineStarts(cleanMetadata.getEarliestCommitToRetain())) { return getPartitionPathsForIncrementalCleaning(cleanMetadata, instantToRetain); Review Comment: Nice catch, can we write a UT if possible? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8375: [MINOR]Remove the redundancy config
hudi-bot commented on PR #8375: URL: https://github.com/apache/hudi/pull/8375#issuecomment-1495434897 ## CI report: * 3a3da94e83aa8d193a7a7351e4c113999a8197b0 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8102: [HUDI-5880] Support partition pruning for flink streaming source in runtime
hudi-bot commented on PR #8102: URL: https://github.com/apache/hudi/pull/8102#issuecomment-1495434217 ## CI report: * a66c8ec83a1a8e75d1e28c3e7444b7c3306049a6 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16106) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on issue #8371: [SUPPORT] Flink cant read metafield '_hoodie_commit_time'
danny0405 commented on issue #8371: URL: https://github.com/apache/hudi/issues/8371#issuecomment-1495432965 Seems a bug, could you fire a PR and fix it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #8344: [HUDI-5968] Fix global index duplicate when update partition
nsivabalan commented on code in PR #8344: URL: https://github.com/apache/hudi/pull/8344#discussion_r1156783371 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java: ## @@ -168,4 +171,36 @@ public static List filterKeysFromFile(Path filePath, List candid } return foundRecordKeys; } + + public static HoodieData> dedupForPartitionUpdates(HoodieData, Boolean>> taggedHoodieRecords, int dedupParallelism) { +/* + * In case a record is updated from p1 to p2 and then to p3, 2 existing records + * will be tagged for the incoming record to insert to p3. So we dedup them here. (Set A) + */ +HoodiePairData> deduped = taggedHoodieRecords.filter(Pair::getRight) +.map(Pair::getLeft) +.distinctWithKey(HoodieRecord::getKey, dedupParallelism) +.mapToPair(r -> Pair.of(r.getRecordKey(), r)); + +/* + * This includes + * - tagged existing records whose partition paths are not to be updated (Set B) + * - completely new records (Set C) + */ +HoodieData> undeduped = taggedHoodieRecords.filter(p -> !p.getRight()).map(Pair::getLeft); + +/* + * There can be intersection between Set A and Set B mentioned above. + * + * Example: record X is updated from p1 to p2 and then back to p1. + * Set A will contain an insert to p1 and Set B will contain an update to p1. + * + * So we let A left-anti join B to drop the insert from Set A and keep the update in Set B. + */ +return deduped.leftOuterJoin(undeduped +.filter(r -> !(r.getData() instanceof EmptyHoodieRecordPayload)) Review Comment: synced up directly. lets add java docs to call this out, ie. why we should strictly favor update record and not insert. so that anyone looking to make any changes in this code block is aware of all the nuances. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #8107: [HUDI-5514] Adding auto generation of record keys support to Hudi/Spark
danny0405 commented on code in PR #8107: URL: https://github.com/apache/hudi/pull/8107#discussion_r1156776979 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/ComplexAvroKeyGenerator.java: ## @@ -44,6 +48,9 @@ public ComplexAvroKeyGenerator(TypedProperties props) { @Override public String getRecordKey(GenericRecord record) { +if (autoGenerateRecordKeys()) { + return StringUtils.EMPTY_STRING; +} Review Comment: We already have `getRecordKey` and `getPartitionPath` as the public API, if you want to fix the `HoodieKey`, shouldn't the `HoodieKey getKey()` be fixed instead? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] huangxiaopingRD commented on a diff in pull request #8351: [HUDI-6013] Support database name for meta sync in bootstrap
huangxiaopingRD commented on code in PR #8351: URL: https://github.com/apache/hudi/pull/8351#discussion_r1156774622 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieCLIUtils.scala: ## @@ -70,4 +70,16 @@ object HoodieCLIUtils extends ProvidesHoodieConfig{ throw new SparkException(s"Unsupported identifier $table") } } + + def getHoodieDatabaseAndTable(table: String): (String, Option[String]) = { +val seq: Seq[String] = table.split('.') Review Comment: done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] c-f-cooper opened a new pull request, #8375: [MINOR]Remove the redundancy config
c-f-cooper opened a new pull request, #8375: URL: https://github.com/apache/hudi/pull/8375 ### Change Logs _Describe context and summary for this change. Highlight if any code was copied._ ### Impact _Describe any public API or user-facing feature change or any performance impact._ ### Risk level (write none, low medium or high below) _If medium or high, explain what verification was done to mitigate the risks._ ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #8351: [HUDI-6013] Support database name for meta sync in bootstrap
danny0405 commented on code in PR #8351: URL: https://github.com/apache/hudi/pull/8351#discussion_r1156765591 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieCLIUtils.scala: ## @@ -70,4 +70,16 @@ object HoodieCLIUtils extends ProvidesHoodieConfig{ throw new SparkException(s"Unsupported identifier $table") } } + + def getHoodieDatabaseAndTable(table: String): (String, Option[String]) = { +val seq: Seq[String] = table.split('.') Review Comment: `getHoodieDatabaseAndTable` -> `getTableIdentifier`, the returned val should be a string array -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8373: [HUDI-5955] fix incremental clean not work cause by archive
hudi-bot commented on PR #8373: URL: https://github.com/apache/hudi/pull/8373#issuecomment-1495391829 ## CI report: * 5c05dcc35fa86f5ec823efb52cb3fc48416f4846 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16105) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bigdata-spec commented on issue #8368: Could not sync using the meta sync class org.apache.hudi.hive.HiveSyncTool[SUPPORT]
bigdata-spec commented on issue #8368: URL: https://github.com/apache/hudi/issues/8368#issuecomment-1495385255 > @bigdata-spec我想此时您必须只使用受支持的 HMS 版本。@huangxiaopingRD可以评论更多。 @ad1happy2go what HMS version means? is fit hudi or spark? hudi support HMS version for 2.1.1-cdh6.3.2. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codope commented on a diff in pull request #8303: [HUDI-5998] Speed up reads from bootstrapped tables in spark
codope commented on code in PR #8303: URL: https://github.com/apache/hudi/pull/8303#discussion_r1156744308 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala: ## @@ -180,6 +180,22 @@ case class HoodieFileIndex(spark: SparkSession, } } + /** + * In the fast bootstrap read code path, it gets the file status for the bootstrap base files instead of + * skeleton files. + */ + private def getBaseFileStatus(baseFiles: mutable.Buffer[HoodieBaseFile]): mutable.Buffer[FileStatus] = { +if (shouldFastBootstrap) { + return baseFiles.map(f => +if (f.getBootstrapBaseFile.isPresent) { + f.getBootstrapBaseFile.get().getFileStatus Review Comment: Why do we need to guard this by `shouldFastBootstrap` conditional? Shouldn't we always return the source file status if it's present? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #8344: [HUDI-5968] Fix global index duplicate when update partition
nsivabalan commented on code in PR #8344: URL: https://github.com/apache/hudi/pull/8344#discussion_r1156745651 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java: ## @@ -168,4 +171,36 @@ public static List filterKeysFromFile(Path filePath, List candid } return foundRecordKeys; } + + public static HoodieData> dedupForPartitionUpdates(HoodieData, Boolean>> taggedHoodieRecords, int dedupParallelism) { +/* + * In case a record is updated from p1 to p2 and then to p3, 2 existing records + * will be tagged for the incoming record to insert to p3. So we dedup them here. (Set A) + */ +HoodiePairData> deduped = taggedHoodieRecords.filter(Pair::getRight) +.map(Pair::getLeft) +.distinctWithKey(HoodieRecord::getKey, dedupParallelism) +.mapToPair(r -> Pair.of(r.getRecordKey(), r)); + +/* + * This includes + * - tagged existing records whose partition paths are not to be updated (Set B) + * - completely new records (Set C) + */ +HoodieData> undeduped = taggedHoodieRecords.filter(p -> !p.getRight()).map(Pair::getLeft); + +/* + * There can be intersection between Set A and Set B mentioned above. + * + * Example: record X is updated from p1 to p2 and then back to p1. + * Set A will contain an insert to p1 and Set B will contain an update to p1. + * + * So we let A left-anti join B to drop the insert from Set A and keep the update in Set B. + */ +return deduped.leftOuterJoin(undeduped +.filter(r -> !(r.getData() instanceof EmptyHoodieRecordPayload)) Review Comment: does it matter if we favor insert or an update here? If yes, I feel its better to favor insert and drop the update. so that we maintain the behavior across the board. i.e. whenever a record migrates from one partition to another, we will ignore whatever in storage and do an insert to incoming partition. to maintain similar semantics, thinking if we shd favor insert record over update. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codope commented on a diff in pull request #8303: [HUDI-5998] Speed up reads from bootstrapped tables in spark
codope commented on code in PR #8303: URL: https://github.com/apache/hudi/pull/8303#discussion_r1156740671 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala: ## @@ -270,6 +271,21 @@ object DefaultSource { } } + private def resolveHoodieBootstrapRelation(sqlContext: SQLContext, + globPaths: Seq[Path], + userSchema: Option[StructType], + metaClient: HoodieTableMetaClient, + parameters: Map[String, String]): BaseRelation = { +val enableFileIndex = HoodieSparkConfUtils.getConfigValue(parameters, sqlContext.sparkSession.sessionState.conf, + ENABLE_HOODIE_FILE_INDEX.key, ENABLE_HOODIE_FILE_INDEX.defaultValue.toString).toBoolean +if (!enableFileIndex || globPaths.nonEmpty || parameters.getOrElse(HoodieBootstrapConfig.DATA_QUERIES_ONLY.key(), "true") != "true") { Review Comment: I think we should do away with the config and rely on the condition here to decide whether or not to use the fast read path (which should be done by default). Wdyt? ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieSparkSqlWriter.scala: ## @@ -807,7 +807,9 @@ class TestHoodieSparkSqlWriter { .option("hoodie.insert.shuffle.parallelism", "4") .mode(SaveMode.Append).save(tempBasePath) - val currentCommits = spark.read.format("hudi").load(tempBasePath).select("_hoodie_commit_time").take(1).map(_.getString(0)) + val currentCommits = spark.read.format("hudi") +.option(HoodieBootstrapConfig.DATA_QUERIES_ONLY.key, "false") Review Comment: Need more tests. Setting it to `false` does not test the changed code path. ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala: ## @@ -180,6 +180,22 @@ case class HoodieFileIndex(spark: SparkSession, } } + /** + * In the fast bootstrap read code path, it gets the file status for the bootstrap base files instead of + * skeleton files. + */ + private def getBaseFileStatus(baseFiles: mutable.Buffer[HoodieBaseFile]): mutable.Buffer[FileStatus] = { +if (shouldFastBootstrap) { + return baseFiles.map(f => +if (f.getBootstrapBaseFile.isPresent) { + f.getBootstrapBaseFile.get().getFileStatus Review Comment: Why do we need to guard this by `shouldFastBootstrap` conditional? Shouldn't we always return the source file status if it's present> ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala: ## @@ -83,10 +83,18 @@ class SparkHoodieTableFileIndex(spark: SparkSession, /** * Get the schema of the table. */ - lazy val schema: StructType = schemaSpec.getOrElse({ -val schemaUtil = new TableSchemaResolver(metaClient) - AvroConversionUtils.convertAvroSchemaToStructType(schemaUtil.getTableAvroSchema) - }) + lazy val schema: StructType = if (shouldFastBootstrap) { + StructType(rawSchema.fields.filterNot(f => HoodieRecord.HOODIE_META_COLUMNS_WITH_OPERATION.contains(f.name))) Review Comment: just import the static member `HOODIE_META_COLUMNS_WITH_OPERATION` instead of importing full `HoodieRecord`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] ad1happy2go commented on issue #8368: Could not sync using the meta sync class org.apache.hudi.hive.HiveSyncTool[SUPPORT]
ad1happy2go commented on issue #8368: URL: https://github.com/apache/hudi/issues/8368#issuecomment-1495363471 @bigdata-spec I guess at this point you have to use the supported HMS version only. @huangxiaopingRD can comment more. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bigdata-spec commented on issue #8368: Could not sync using the meta sync class org.apache.hudi.hive.HiveSyncTool[SUPPORT]
bigdata-spec commented on issue #8368: URL: https://github.com/apache/hudi/issues/8368#issuecomment-1495349137 > `spark.sql.hive.metastore.version` is not supported in hudi. hudi not compatible with all hive metastore version like Spark. So,What can I do deal with this error? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8374: [HUDI-6030] Cleans the ckp meta while the JM restarts
hudi-bot commented on PR #8374: URL: https://github.com/apache/hudi/pull/8374#issuecomment-1495349020 ## CI report: * 7c8e63752c3f709c3102a5c412c1ec9c40846b90 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16111) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8107: [HUDI-5514] Adding auto generation of record keys support to Hudi/Spark
hudi-bot commented on PR #8107: URL: https://github.com/apache/hudi/pull/8107#issuecomment-1495348583 ## CI report: * 572189472623065f460bd18436fb3b21602449af Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16101) * 711df161776bfbe4f66cb04310eb15ccc0069716 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16110) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8344: [HUDI-5968] Fix global index duplicate when update partition
hudi-bot commented on PR #8344: URL: https://github.com/apache/hudi/pull/8344#issuecomment-1495343022 ## CI report: * 3c004c60160b06b0f4a7a00980c2013cf21af3c3 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16104) * 7624300eb0d7205a4924783606226bbdfd49ad5a Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16109) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8374: [HUDI-6030] Cleans the ckp meta while the JM restarts
hudi-bot commented on PR #8374: URL: https://github.com/apache/hudi/pull/8374#issuecomment-1495343365 ## CI report: * 7c8e63752c3f709c3102a5c412c1ec9c40846b90 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8107: [HUDI-5514] Adding auto generation of record keys support to Hudi/Spark
hudi-bot commented on PR #8107: URL: https://github.com/apache/hudi/pull/8107#issuecomment-1495342734 ## CI report: * 572189472623065f460bd18436fb3b21602449af Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16101) * 711df161776bfbe4f66cb04310eb15ccc0069716 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codope commented on a diff in pull request #8344: [HUDI-5968] Fix global index duplicate when update partition
codope commented on code in PR #8344: URL: https://github.com/apache/hudi/pull/8344#discussion_r1156725280 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java: ## @@ -244,6 +244,12 @@ public class HoodieIndexConfig extends HoodieConfig { .defaultValue("true") .withDocumentation("Similar to " + BLOOM_INDEX_UPDATE_PARTITION_PATH_ENABLE + ", but for simple index."); + public static final ConfigProperty GLOBAL_INDEX_DEDUP_PARALLELISM = ConfigProperty Review Comment: ok, let's keep it this way. we can revisit later if necessary. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bvaradar commented on pull request #5165: [HUDI-3742] Enable parquet enableVectorizedReader for spark inc query to improve peformance
bvaradar commented on PR #5165: URL: https://github.com/apache/hudi/pull/5165#issuecomment-1495341234 @xiarixiaoyao : Can you address the comments in the PR ? @garyli1019 : Any other concern about having vectorization for incr query for MOR (with default turned off ? ) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8367: [HUDI-6023] HotFix in HoodieDynamicBoundedBloomFilter with refactor a…
hudi-bot commented on PR #8367: URL: https://github.com/apache/hudi/pull/8367#issuecomment-1495339250 ## CI report: * 38951b92ba068d155efc85b1b38ce860bf3551d4 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16091) Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16102) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8344: [HUDI-5968] Fix global index duplicate when update partition
hudi-bot commented on PR #8344: URL: https://github.com/apache/hudi/pull/8344#issuecomment-1495339170 ## CI report: * 3c004c60160b06b0f4a7a00980c2013cf21af3c3 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16104) * 7624300eb0d7205a4924783606226bbdfd49ad5a UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8335: [HUDI-6009] Let the jetty server in TimelineService create daemon threads
hudi-bot commented on PR #8335: URL: https://github.com/apache/hudi/pull/8335#issuecomment-1495339120 ## CI report: * f5ffa39e26536c54bcdd7d29b96b8ef242203b3c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16096) * 9bcbb85e4b2bb803e03900b8f01c938833bb1185 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16108) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on issue #8060: [SUPPORT] An instant exception occurs when the flink job is restarted
danny0405 commented on issue #8060: URL: https://github.com/apache/hudi/issues/8060#issuecomment-1495338029 Fire a fix in: https://github.com/apache/hudi/pull/8374 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6030) Cleans the ckp meta while the JM restarts
[ https://issues.apache.org/jira/browse/HUDI-6030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6030: - Labels: pull-request-available (was: ) > Cleans the ckp meta while the JM restarts > - > > Key: HUDI-6030 > URL: https://issues.apache.org/jira/browse/HUDI-6030 > Project: Apache Hudi > Issue Type: Improvement > Components: flink >Reporter: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 0.13.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] danny0405 opened a new pull request, #8374: [HUDI-6030] Cleans the ckp meta while the JM restarts
danny0405 opened a new pull request, #8374: URL: https://github.com/apache/hudi/pull/8374 ### Change Logs We received several bug reports since #7620, for example: https://github.com/apache/hudi/issues/8060, this patch revert the changes of `CkpMetadata` and already report the write metadata events for write task, the coordinator would decide whether to re-commit these metadata stats. ### Impact Fix the problem introduced by #7620. ### Risk level (write none, low medium or high below) none ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6030) Cleans the ckp meta while the JM restarts
Danny Chen created HUDI-6030: Summary: Cleans the ckp meta while the JM restarts Key: HUDI-6030 URL: https://issues.apache.org/jira/browse/HUDI-6030 Project: Apache Hudi Issue Type: Improvement Components: flink Reporter: Danny Chen Fix For: 0.13.1 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] bvaradar commented on pull request #7748: [WIP][HUDI-5560] Make Consistent hash index Bucket Resizing more available…
bvaradar commented on PR #7748: URL: https://github.com/apache/hudi/pull/7748#issuecomment-1495325226 @fengjian428 : Is this RFC ready for review ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bvaradar commented on pull request #7962: [HUDI-5801] Speed metaTable initializeFileGroups
bvaradar commented on PR #7962: URL: https://github.com/apache/hudi/pull/7962#issuecomment-1495324171 @loukey-lj : Have you seen slowness in metatable initialization in practice before. For cases like PARTITION_NAME_FILES metadata, the number of file-groups is 1. Running under engine context would result in more overhead for such case. cc @nsivabalan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8335: [HUDI-6009] Let the jetty server in TimelineService create daemon threads
hudi-bot commented on PR #8335: URL: https://github.com/apache/hudi/pull/8335#issuecomment-1495315767 ## CI report: * f5ffa39e26536c54bcdd7d29b96b8ef242203b3c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16096) * 9bcbb85e4b2bb803e03900b8f01c938833bb1185 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6029) Rollback may omit invalid files when commitMetadata is not completed for MOR
lei w created HUDI-6029: --- Summary: Rollback may omit invalid files when commitMetadata is not completed for MOR Key: HUDI-6029 URL: https://issues.apache.org/jira/browse/HUDI-6029 Project: Apache Hudi Issue Type: Bug Reporter: lei w Now ,Use listingBasedRollbackStrategy may omit invalid files when commitMetadata is not completed.The reason for this problem is due to use instantToRollback timestamp and the baseCommitTime of the logFile to judge whether the Logfiles is valid. {code:java} // commit is instant time which should be rollback // in most cases BaseCommitTime may not equals commit (path) -> { if (path.toString().endsWith(basefileExtension)) { String fileCommitTime = FSUtils.getCommitTime(path.getName()); return commit.equals(fileCommitTime); } else if (FSUtils.isLogFile(path)) { // Since the baseCommitTime is the only commit for new log files, it's okay here String fileCommitTime = FSUtils.getBaseCommitTimeFromLogPath(path); return commit.equals(fileCommitTime); } return false; }; {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] bvaradar commented on pull request #6705: [HUDI-4868] Fixed the issue that compaction is invalid when the last commit action is replace commit.
bvaradar commented on PR #6705: URL: https://github.com/apache/hudi/pull/6705#issuecomment-1495309319 @watermelon12138 : Pinging to see if you are interested in updating this PR ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8344: [HUDI-5968] Fix global index duplicate when update partition
hudi-bot commented on PR #8344: URL: https://github.com/apache/hudi/pull/8344#issuecomment-1495307023 ## CI report: * 3c004c60160b06b0f4a7a00980c2013cf21af3c3 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16104) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bvaradar commented on pull request #7956: [HUDI-5797] fix use bulk insert error as row
bvaradar commented on PR #7956: URL: https://github.com/apache/hudi/pull/7956#issuecomment-1495306934 @KnightChess : I am not sure I understand why this is only the problem with bulkInsert as row. Is problem because when doing MDT init, files which are not committed (empty/partial) are being added (see HoodieBackedTableMetadataWriter.listAllPartitions) . @prashantwason : Can you let me know if I am missing something. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8107: [HUDI-5514] Adding auto generation of record keys support to Hudi/Spark
hudi-bot commented on PR #8107: URL: https://github.com/apache/hudi/pull/8107#issuecomment-1495306511 ## CI report: * 572189472623065f460bd18436fb3b21602449af Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16101) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7173: [HUDI-5189] Make HiveAvroSerializer compatible with hive3
hudi-bot commented on PR #7173: URL: https://github.com/apache/hudi/pull/7173#issuecomment-1495305780 ## CI report: * 363aad76c3a145bdd38aa83488efdaa6d5ac1d82 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16012) * 2ff867c31714270d57518a0c7ca30c7ee98ce612 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16107) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] rmahindra123 commented on a diff in pull request #8328: [HUDI-6002] Add JsonSchemaKafkaSource to handle json schema payload
rmahindra123 commented on code in PR #8328: URL: https://github.com/apache/hudi/pull/8328#discussion_r1156689995 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JsonSchemaKafkaSource.java: ## @@ -0,0 +1,127 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.utilities.sources; + +import org.apache.hudi.common.config.TypedProperties; +import org.apache.hudi.common.util.StringUtils; +import org.apache.hudi.exception.HoodieException; +import org.apache.hudi.utilities.UtilHelpers; +import org.apache.hudi.utilities.exception.HoodieSourcePostProcessException; +import org.apache.hudi.utilities.ingestion.HoodieIngestionMetrics; +import org.apache.hudi.utilities.schema.SchemaProvider; +import org.apache.hudi.utilities.sources.helpers.KafkaOffsetGen; +import org.apache.hudi.utilities.sources.processor.JsonKafkaSourcePostProcessor; + +import com.fasterxml.jackson.core.JsonProcessingException; +import com.fasterxml.jackson.databind.ObjectMapper; +import com.fasterxml.jackson.databind.node.ObjectNode; +import org.apache.kafka.clients.consumer.ConsumerRecord; +import org.apache.log4j.LogManager; +import org.apache.log4j.Logger; +import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.api.java.JavaSparkContext; +import org.apache.spark.sql.SparkSession; +import org.apache.spark.streaming.kafka010.KafkaUtils; +import org.apache.spark.streaming.kafka010.LocationStrategies; +import org.apache.spark.streaming.kafka010.OffsetRange; + +import java.io.IOException; +import java.util.LinkedHashMap; +import java.util.LinkedList; +import java.util.List; + +import static org.apache.hudi.utilities.schema.KafkaOffsetPostProcessor.KAFKA_SOURCE_OFFSET_COLUMN; +import static org.apache.hudi.utilities.schema.KafkaOffsetPostProcessor.KAFKA_SOURCE_PARTITION_COLUMN; +import static org.apache.hudi.utilities.schema.KafkaOffsetPostProcessor.KAFKA_SOURCE_TIMESTAMP_COLUMN; + +public class JsonSchemaKafkaSource extends JsonKafkaSource { Review Comment: +1 looks like a lot of repetitive code -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] rmahindra123 commented on a diff in pull request #8328: [HUDI-6002] Add JsonSchemaKafkaSource to handle json schema payload
rmahindra123 commented on code in PR #8328: URL: https://github.com/apache/hudi/pull/8328#discussion_r1156689995 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JsonSchemaKafkaSource.java: ## @@ -0,0 +1,127 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.utilities.sources; + +import org.apache.hudi.common.config.TypedProperties; +import org.apache.hudi.common.util.StringUtils; +import org.apache.hudi.exception.HoodieException; +import org.apache.hudi.utilities.UtilHelpers; +import org.apache.hudi.utilities.exception.HoodieSourcePostProcessException; +import org.apache.hudi.utilities.ingestion.HoodieIngestionMetrics; +import org.apache.hudi.utilities.schema.SchemaProvider; +import org.apache.hudi.utilities.sources.helpers.KafkaOffsetGen; +import org.apache.hudi.utilities.sources.processor.JsonKafkaSourcePostProcessor; + +import com.fasterxml.jackson.core.JsonProcessingException; +import com.fasterxml.jackson.databind.ObjectMapper; +import com.fasterxml.jackson.databind.node.ObjectNode; +import org.apache.kafka.clients.consumer.ConsumerRecord; +import org.apache.log4j.LogManager; +import org.apache.log4j.Logger; +import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.api.java.JavaSparkContext; +import org.apache.spark.sql.SparkSession; +import org.apache.spark.streaming.kafka010.KafkaUtils; +import org.apache.spark.streaming.kafka010.LocationStrategies; +import org.apache.spark.streaming.kafka010.OffsetRange; + +import java.io.IOException; +import java.util.LinkedHashMap; +import java.util.LinkedList; +import java.util.List; + +import static org.apache.hudi.utilities.schema.KafkaOffsetPostProcessor.KAFKA_SOURCE_OFFSET_COLUMN; +import static org.apache.hudi.utilities.schema.KafkaOffsetPostProcessor.KAFKA_SOURCE_PARTITION_COLUMN; +import static org.apache.hudi.utilities.schema.KafkaOffsetPostProcessor.KAFKA_SOURCE_TIMESTAMP_COLUMN; + +public class JsonSchemaKafkaSource extends JsonKafkaSource { Review Comment: +1 looks like a lot of repeatitive code -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8102: [HUDI-5880] Support partition pruning for flink streaming source in runtime
hudi-bot commented on PR #8102: URL: https://github.com/apache/hudi/pull/8102#issuecomment-1495279686 ## CI report: * be88d99070504f75c88bfcf48b3c078ca93a35df Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16097) * a66c8ec83a1a8e75d1e28c3e7444b7c3306049a6 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16106) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bvaradar commented on pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert
bvaradar commented on PR #7834: URL: https://github.com/apache/hudi/pull/7834#issuecomment-1495279011 @wuwenchi : Can you look at the PR comments and address them when you get a chance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7173: [HUDI-5189] Make HiveAvroSerializer compatible with hive3
hudi-bot commented on PR #7173: URL: https://github.com/apache/hudi/pull/7173#issuecomment-1495278342 ## CI report: * 363aad76c3a145bdd38aa83488efdaa6d5ac1d82 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16012) * 2ff867c31714270d57518a0c7ca30c7ee98ce612 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codope commented on pull request #7942: [HUDI-5753] Add docs for record payload
codope commented on PR #7942: URL: https://github.com/apache/hudi/pull/7942#issuecomment-1495277107 > @codope is it possible you can provide an example to extend the payload for a customized option. Also, are there configs the user should consider that's provided out-of-the-box? If possible, can you specify all of them inline with the right class? @nfarah86 I have added a link to FAQ where there are more details on how to implement a custom payload. I have also removed the record merger API. Need to follow up with a separate doc or update this doc in a separate PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8373: [HUDI-5955] fix incremental clean not work cause by archive
hudi-bot commented on PR #8373: URL: https://github.com/apache/hudi/pull/8373#issuecomment-1495274194 ## CI report: * 5c05dcc35fa86f5ec823efb52cb3fc48416f4846 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16105) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8102: [HUDI-5880] Support partition pruning for flink streaming source in runtime
hudi-bot commented on PR #8102: URL: https://github.com/apache/hudi/pull/8102#issuecomment-1495273281 ## CI report: * be88d99070504f75c88bfcf48b3c078ca93a35df Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16097) * a66c8ec83a1a8e75d1e28c3e7444b7c3306049a6 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codope merged pull request #7985: [DOCS] Update clustering docs
codope merged PR #7985: URL: https://github.com/apache/hudi/pull/7985 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch asf-site updated: [DOCS] Update clustering docs (#7985)
This is an automated email from the ASF dual-hosted git repository. codope pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new 76b212fed0a [DOCS] Update clustering docs (#7985) 76b212fed0a is described below commit 76b212fed0a766fe0a2edd4c04215bb52e718343 Author: Sagar Sumit AuthorDate: Tue Apr 4 08:22:31 2023 +0530 [DOCS] Update clustering docs (#7985) --- website/docs/clustering.md | 231 ++--- .../assets/images/clustering_small_files.gif | Bin 0 -> 668806 bytes website/static/assets/images/clustering_sort.gif | Bin 0 -> 735437 bytes 3 files changed, 159 insertions(+), 72 deletions(-) diff --git a/website/docs/clustering.md b/website/docs/clustering.md index 9e157de785b..d2ceb196d02 100644 --- a/website/docs/clustering.md +++ b/website/docs/clustering.md @@ -10,6 +10,17 @@ last_modified_at: Apache Hudi brings stream processing to big data, providing fresh data while being an order of magnitude efficient over traditional batch processing. In a data lake/warehouse, one of the key trade-offs is between ingestion speed and query performance. Data ingestion typically prefers small files to improve parallelism and make data available to queries as soon as possible. However, query performance degrades poorly with a lot of small files. Also, during ingestion, data is typically co-l [...] +## How is compaction different from clustering? + +Hudi is modeled like a log-structured storage engine with multiple versions of the data. +Particularly, [Merge-On-Read](/docs/table_types#merge-on-read-table) +tables in Hudi store data using a combination of base file in columnar format and row-based delta logs that contain +updates. Compaction is a way to merge the delta logs with base files to produce the latest file slices with the most +recent snapshot of data. Compaction helps to keep the query performance in check (larger delta log files would incur +longer merge times on query side). On the other hand, clustering is a data layout optimization technique. One can stitch +together small files into larger files using clustering. Additionally, data can be clustered by sort key so that queries +can take advantage of data locality. + ## Clustering Architecture At a high level, Hudi provides different operations such as insert/upsert/bulk_insert through it’s write client API to be able to write data to a Hudi table. To be able to choose a trade-off between file size and ingestion speed, Hudi provides a knob `hoodie.parquet.small.file.limit` to be able to configure the smallest allowable file size. Users are able to configure the small file [soft limit](https://hudi.apache.org/docs/configurations/#hoodieparquetsmallfilelimit) to `0` to force new [...] @@ -22,13 +33,13 @@ Clustering table service can run asynchronously or synchronously adding a new ac -### Overall, there are 2 parts to clustering +### Overall, there are 2 steps to clustering 1. Scheduling clustering: Create a clustering plan using a pluggable clustering strategy. 2. Execute clustering: Process the plan using an execution strategy to create new files and replace old files. -### Scheduling clustering +### Schedule clustering Following steps are followed to schedule clustering. @@ -37,7 +48,7 @@ Following steps are followed to schedule clustering. 3. Finally, the clustering plan is saved to the timeline in an avro [metadata format](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieClusteringPlan.avsc). -### Running clustering +### Execute clustering 1. Read the clustering plan and get the ‘clusteringGroups’ that mark the file groups that need to be clustered. 2. For each group, we instantiate appropriate strategy class with strategyParams (example: sortColumns) and apply that strategy to rewrite the data. @@ -51,8 +62,147 @@ NOTE: Clustering can only be scheduled for tables / partitions not receiving any ![Clustering example](/assets/images/blog/clustering/example_perf_improvement.png) _Figure: Illustrating query performance improvements by clustering_ -### Setting up clustering -Inline clustering can be setup easily using spark dataframe options. See sample below +## Clustering Usecases + +### Batching small files + +As mentioned in the intro, streaming ingestion generally results in smaller files in your data lake. But having a lot of +such small files could lead to higher query latency. From our experience supporting community users, there are quite a +few users who are using Hudi just for small file handling capabilities. So, you could employ clustering to batch a lot +of such small files into larger ones. + +![Batching small files](/assets/images/clustering_small_files.gif) + +### Cluster by sort key + +Another classic problem in data lake is the arrival time vs event time prob
[GitHub] [hudi] hudi-bot commented on pull request #8373: [HUDI-5955] fix incremental clean not work cause by archive
hudi-bot commented on PR #8373: URL: https://github.com/apache/hudi/pull/8373#issuecomment-1495266885 ## CI report: * 5c05dcc35fa86f5ec823efb52cb3fc48416f4846 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bvaradar commented on a diff in pull request #6799: [HUDI-4920] fix PartialUpdatePayload cannot return deleted record in …
bvaradar commented on code in PR #6799: URL: https://github.com/apache/hudi/pull/6799#discussion_r1156663626 ## hudi-common/src/test/java/org/apache/hudi/common/model/TestPartialUpdateAvroPayload.java: ## @@ -155,8 +155,8 @@ public void testDeletedRecord() throws IOException { PartialUpdateAvroPayload payload1 = new PartialUpdateAvroPayload(record1, 0L); PartialUpdateAvroPayload payload2 = new PartialUpdateAvroPayload(delRecord1, 1L); -assertArrayEquals(payload1.preCombine(payload2).recordBytes, payload2.recordBytes); -assertArrayEquals(payload2.preCombine(payload1).recordBytes, payload2.recordBytes); +assertArrayEquals(payload1.preCombine(payload2, schema, new Properties()).recordBytes, payload2.recordBytes); +assertArrayEquals(payload2.preCombine(payload1, schema, new Properties()).recordBytes, payload2.recordBytes); Review Comment: Can you add an explicit test-case for deleted record case here during precombine. The test-case needs to check for _hoodie_is_deleted flag in the returned record. ## hudi-common/src/main/java/org/apache/hudi/common/model/PartialUpdateAvroPayload.java: ## @@ -89,6 +89,8 @@ */ public class PartialUpdateAvroPayload extends OverwriteNonDefaultsWithLatestAvroPayload { + private boolean isPreCombining = false; Review Comment: This member variable needs to be removed as this is no longer used. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] huangxiaopingRD commented on issue #8368: Could not sync using the meta sync class org.apache.hudi.hive.HiveSyncTool[SUPPORT]
huangxiaopingRD commented on issue #8368: URL: https://github.com/apache/hudi/issues/8368#issuecomment-1495252052 `spark.sql.hive.metastore.version` is not supported in hudi. hudi not compatible with all hive metastore version like Spark. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hbgstc123 commented on pull request #8232: [HUDI-5955] fix incremental clean not work caused by archive
hbgstc123 commented on PR #8232: URL: https://github.com/apache/hudi/pull/8232#issuecomment-1495239401 https://github.com/apache/hudi/pull/8373 I submit a new pr that fallback to full clean if instant needed for incremental clean is archived. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hbgstc123 opened a new pull request, #8373: [HUDI-5955] fix incremental clean not work cause by archive
hbgstc123 opened a new pull request, #8373: URL: https://github.com/apache/hudi/pull/8373 ### Change Logs Incremental timeline may miss some partition if the instant after "earliest retained instant" of last complete clean plan is archived, so fallback to full clean if earliest instant to retain is before active timeline. ### Impact no ### Risk level (write none, low medium or high below) low ### Documentation Update no ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] LiJie20190102 commented on issue #8331: [SUPPORT] When using the HoodieDeltaStreamer, is there a corresponding parameter that can control the number of cycles? For example, if I cycle
LiJie20190102 commented on issue #8331: URL: https://github.com/apache/hudi/issues/8331#issuecomment-1495234468 @ad1happy2go Should we stop SparkContext? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8344: [HUDI-5968] Fix global index duplicate when update partition
hudi-bot commented on PR #8344: URL: https://github.com/apache/hudi/pull/8344#issuecomment-1495232941 ## CI report: * fa1b1525a163af85271f0dc9e0d5765ea2075044 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16058) * 3c004c60160b06b0f4a7a00980c2013cf21af3c3 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16104) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8344: [HUDI-5968] Fix global index duplicate when update partition
hudi-bot commented on PR #8344: URL: https://github.com/apache/hudi/pull/8344#issuecomment-1495227355 ## CI report: * fa1b1525a163af85271f0dc9e0d5765ea2075044 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16058) * 3c004c60160b06b0f4a7a00980c2013cf21af3c3 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8107: [HUDI-5514] Adding auto generation of record keys support to Hudi/Spark
hudi-bot commented on PR #8107: URL: https://github.com/apache/hudi/pull/8107#issuecomment-1495226882 ## CI report: * d3e3d9ffd1bf60dabfb36d37133493683ea56a4c Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16100) * 572189472623065f460bd18436fb3b21602449af Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16101) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8367: [HUDI-6023] HotFix in HoodieDynamicBoundedBloomFilter with refactor a…
hudi-bot commented on PR #8367: URL: https://github.com/apache/hudi/pull/8367#issuecomment-1495222704 ## CI report: * 38951b92ba068d155efc85b1b38ce860bf3551d4 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16091) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16102) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7881: [HUDI-5723] Automate and standardize enum configs
hudi-bot commented on PR #7881: URL: https://github.com/apache/hudi/pull/7881#issuecomment-1495222123 ## CI report: * c378a74c177a2f1a924609a44f0978ee347d272a UNKNOWN * 6fd0ec68de1fc063cc3e79bea173e9f073d4517e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16099) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8107: [HUDI-5514] Adding auto generation of record keys support to Hudi/Spark
hudi-bot commented on PR #8107: URL: https://github.com/apache/hudi/pull/8107#issuecomment-1495222373 ## CI report: * 09d9feab5048d47a149f4088c23af9b5072250fa Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16077) * d3e3d9ffd1bf60dabfb36d37133493683ea56a4c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16100) * 572189472623065f460bd18436fb3b21602449af UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-5983) Improve loading data via cloud store incr source
[ https://issues.apache.org/jira/browse/HUDI-5983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu closed HUDI-5983. Fix Version/s: 0.14.0 Assignee: Raymond Xu Resolution: Fixed > Improve loading data via cloud store incr source > - > > Key: HUDI-5983 > URL: https://issues.apache.org/jira/browse/HUDI-5983 > Project: Apache Hudi > Issue Type: Improvement > Components: deltastreamer >Reporter: Raymond Xu >Assignee: Raymond Xu >Priority: Major > Labels: pull-request-available > Fix For: 0.13.1, 0.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[hudi] branch master updated (627b608e3eb -> 5d5658347ad)
This is an automated email from the ASF dual-hosted git repository. xushiyan pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from 627b608e3eb [MINOR] Optimize code style (#8357) add 5d5658347ad [HUDI-5983] Improve loading data via cloud store incr source (#8290) No new revisions were added by this update. Summary of changes: .../sources/GcsEventsHoodieIncrSource.java | 36 .../sources/S3EventsHoodieIncrSource.java | 86 +++--- .../sources/helpers/CloudObjectMetadata.java | 27 -- .../helpers/CloudObjectsSelectorCommon.java| 88 +++--- ...eDataFetcher.java => GcsObjectDataFetcher.java} | 14 +-- ...sFetcher.java => GcsObjectMetadataFetcher.java} | 42 - .../utilities/sources/helpers/gcs/QueryInfo.java | 2 +- .../sources/TestGcsEventsHoodieIncrSource.java | 101 - 8 files changed, 189 insertions(+), 207 deletions(-) copy hudi-common/src/main/java/org/apache/hudi/common/function/SerializablePairFlatMapFunction.java => hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectMetadata.java (69%) rename hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/gcs/{FileDataFetcher.java => GcsObjectDataFetcher.java} (72%) rename hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/gcs/{FilePathsFetcher.java => GcsObjectMetadataFetcher.java} (67%)
[GitHub] [hudi] xushiyan merged pull request #8290: [HUDI-5983] Improve loading data via cloud store incr source
xushiyan merged PR #8290: URL: https://github.com/apache/hudi/pull/8290 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on pull request #8290: [HUDI-5983] Improve loading data via cloud store incr source
xushiyan commented on PR #8290: URL: https://github.com/apache/hudi/pull/8290#issuecomment-1495220009 ![Screenshot 2023-04-03 at 8 41 39 PM](https://user-images.githubusercontent.com/2701446/229664578-243eaafc-a52f-4e05-b0b1-1f2f4af07e08.png) CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xuzifu666 commented on pull request #8367: [HUDI-6023] HotFix in HoodieDynamicBoundedBloomFilter with refactor a…
xuzifu666 commented on PR #8367: URL: https://github.com/apache/hudi/pull/8367#issuecomment-1495219415 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] LiJie20190102 commented on issue #8331: [SUPPORT] When using the HoodieDeltaStreamer, is there a corresponding parameter that can control the number of cycles? For example, if I cycle
LiJie20190102 commented on issue #8331: URL: https://github.com/apache/hudi/issues/8331#issuecomment-1495213134 > @LiJie20190102 Can you let use know the complete spark-submit command you are using. I found a configuration: "-- post write termination strategy class". I tried using the 'org.apache.hudi.utilities.deltastreamer.NoNewDataTerminationStrategy' to stop the task, but it didn't seem to meet my expectations. I think that after it stops ExecutorService, the subsequent SparkContext will also stop, but now SparkContext will always be started and no subsequent logs will be visible. ![image](https://user-images.githubusercontent.com/53458004/229662805-e1b4bfa2-31f6-4ad1-aede-860ecb6af143.png) ![image](https://user-images.githubusercontent.com/53458004/229662822-ce078c25-467d-44fa-b286-db5d3d2e8d07.png) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bigdata-spec commented on issue #8368: Could not sync using the meta sync class org.apache.hudi.hive.HiveSyncTool[SUPPORT]
bigdata-spec commented on issue #8368: URL: https://github.com/apache/hudi/issues/8368#issuecomment-1495208401 @huangxiaopingRD @ad1happy2go Thank you for your kindness. HMS version is 2.1.1-cdh6.3.2. our environment is cdh6.3.2, we want to use **Apache Spark3.1.1** to replace **2.4.0-cdh6.3.2 for spark** so I use command: `./dev/make-distribution.sh --name 3.0.0-cdh6.3.2 --tgz -Pyarn -Phive-thriftserver -Dhadoop.version=3.0.0-cdh6.3.2 ` and I get **spark-3.1.1-bin-3.0.0-cdh6.3.2.tgz** spark-defaults.conf I set ``` spark.sql.hive.metastore.version=2.1.1 spark.sql.hive.metastore.jars=/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/hive/lib/* ``` it work well for common hive table. but hudi table can create ,but can't insert. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] rahil-c commented on pull request #5391: [HUDI-3945] After the async compaction operation is complete, the task should exit
rahil-c commented on PR #5391: URL: https://github.com/apache/hudi/pull/5391#issuecomment-1495174849 @watermelon12138 What spark version did you encounter the issue which prompted you to create the pr? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bithw1 closed issue #8370: [SUPPORT]What's the difference between time-travel-query and point-in-time-query in the doc.
bithw1 closed issue #8370: [SUPPORT]What's the difference between time-travel-query and point-in-time-query in the doc. URL: https://github.com/apache/hudi/issues/8370 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bithw1 commented on issue #8370: [SUPPORT]What's the difference between time-travel-query and point-in-time-query in the doc.
bithw1 commented on issue #8370: URL: https://github.com/apache/hudi/issues/8370#issuecomment-1495169282 Thanks @ad1happy2go , we have the same understanding,thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] rahil-c commented on pull request #5391: [HUDI-3945] After the async compaction operation is complete, the task should exit
rahil-c commented on PR #5391: URL: https://github.com/apache/hudi/pull/5391#issuecomment-1495127618 @yihua @xiarixiaoyao Wanted to get commmunity thoughts if this is safe to revert, I also tried the steps mentioned in the JIRA to see if this `sys.exit` is required https://issues.apache.org/jira/browse/HUDI-3945 but from my own repro without the sys exit call things are working fine similar to what @TengHuo mentioned The concern with this `sys.exit` call can be seen here mentioned in spark code https://github.com/apache/spark/blob/v3.3.0/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L258 ``` // If user application is exited ahead of time by calling System.exit(N), here mark // this application as failed with EXIT_EARLY. For a good shutdown, user shouldn't call // System.exit(0) to terminate the application. ``` This is where the `ApplicationMaster: Final app status: FAILED, exitCode: 16, (reason: Shutdown hook called before final status was reported.)` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8102: [HUDI-5880] Support partition pruning for flink streaming source in runtime
hudi-bot commented on PR #8102: URL: https://github.com/apache/hudi/pull/8102#issuecomment-1495101369 ## CI report: * be88d99070504f75c88bfcf48b3c078ca93a35df Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16097) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8335: [HUDI-6009] Let the jetty server in TimelineService create daemon threads
hudi-bot commented on PR #8335: URL: https://github.com/apache/hudi/pull/8335#issuecomment-1495086954 ## CI report: * f5ffa39e26536c54bcdd7d29b96b8ef242203b3c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16096) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8231: [HUDI-5963] Release 0.13.1 prep
hudi-bot commented on PR #8231: URL: https://github.com/apache/hudi/pull/8231#issuecomment-1495035017 ## CI report: * 1041e445959cf9148ab904b3d456884e0ead7f9e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16095) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8326: [HUDI-6006] Deprecate hoodie.payload.ordering.field
hudi-bot commented on PR #8326: URL: https://github.com/apache/hudi/pull/8326#issuecomment-1495026283 ## CI report: * 4b0c681e00e9ac437a7ff039a0cb827fd5420470 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16094) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8107: [HUDI-5514] Adding auto generation of record keys support to Hudi/Spark
hudi-bot commented on PR #8107: URL: https://github.com/apache/hudi/pull/8107#issuecomment-1494975080 ## CI report: * 09d9feab5048d47a149f4088c23af9b5072250fa Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16077) * d3e3d9ffd1bf60dabfb36d37133493683ea56a4c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16100) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] peter-mccabe commented on issue #8144: [SUPPORT]Unable to connect to an s3 hudi table
peter-mccabe commented on issue #8144: URL: https://github.com/apache/hudi/issues/8144#issuecomment-1494974166 any update on this? i really need to be able to manage this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8107: [HUDI-5514] Adding auto generation of record keys support to Hudi/Spark
hudi-bot commented on PR #8107: URL: https://github.com/apache/hudi/pull/8107#issuecomment-1494967724 ## CI report: * 09d9feab5048d47a149f4088c23af9b5072250fa Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16077) * d3e3d9ffd1bf60dabfb36d37133493683ea56a4c UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on a diff in pull request #8344: [HUDI-5968] Fix global index duplicate when update partition
xushiyan commented on code in PR #8344: URL: https://github.com/apache/hudi/pull/8344#discussion_r1156432021 ## hudi-common/src/test/java/org/apache/hudi/common/testutils/HoodieSimpleDataGenerator.java: ## @@ -0,0 +1,68 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.common.testutils; + +import org.apache.hudi.common.model.DefaultHoodieRecordPayload; +import org.apache.hudi.common.model.HoodieAvroRecord; +import org.apache.hudi.common.model.HoodieKey; +import org.apache.hudi.common.model.HoodieRecord; + +import org.apache.avro.Schema; +import org.apache.avro.generic.GenericData; +import org.apache.avro.generic.GenericRecord; + +import java.util.List; +import java.util.stream.Collectors; +import java.util.stream.IntStream; + +public class HoodieSimpleDataGenerator { Review Comment: `HoodieTestDataGenerator` actually needs an overhaul as the APIs became unorganized over the years and hard to use. More importantly, randomness is a big cause to flakiness and we need a deterministic data gen more than a random data gen for UT/FT scenarios. I can revert this back to using existing data gen class and let the future overhaul work cover the new class adoption. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #8107: [HUDI-5514] Adding auto generation of record keys support to Hudi/Spark
nsivabalan commented on code in PR #8107: URL: https://github.com/apache/hudi/pull/8107#discussion_r1156418648 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/ComplexAvroKeyGenerator.java: ## @@ -44,6 +48,9 @@ public ComplexAvroKeyGenerator(TypedProperties props) { @Override public String getRecordKey(GenericRecord record) { +if (autoGenerateRecordKeys()) { + return StringUtils.EMPTY_STRING; +} Review Comment: this is kind of unavoidable as of current structure. For eg, even to fetch partition path, our KeyGenerator interface, only exposes ``` HoodieKey getKey(GenericRecord record) ``` So, to fetch partition path, we have to call getKey(genRec).getPartitionPath and hence I had to return empty string here. we don't want to add a new api to the interface just for this purpose. Incase of auto key gen flows, we generate the record keys explicitly (not via key gen class) and add it to HoodieKey that we materialize in memory for all records. I can sync up w/ you f2f to clarify this. Ideally, we need to have 2 different interfaces. one to generate partition path and one to generate record key. and so some of these workarounds may not be required. but w/ current structure, we use a single key gen class to generate both record keys and partition paths as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on a diff in pull request #8344: [HUDI-5968] Fix global index duplicate when update partition
xushiyan commented on code in PR #8344: URL: https://github.com/apache/hudi/pull/8344#discussion_r1156417851 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java: ## @@ -244,6 +244,12 @@ public class HoodieIndexConfig extends HoodieConfig { .defaultValue("true") .withDocumentation("Similar to " + BLOOM_INDEX_UPDATE_PARTITION_PATH_ENABLE + ", but for simple index."); + public static final ConfigProperty GLOBAL_INDEX_DEDUP_PARALLELISM = ConfigProperty Review Comment: not very clear at the moment, given this is still tunable depends on the data's update ratio. it may stay as a infrequently used one like `hoodie.markers.delete.parallelism` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #8107: [HUDI-5514] Adding auto generation of record keys support to Hudi/Spark
nsivabalan commented on code in PR #8107: URL: https://github.com/apache/hudi/pull/8107#discussion_r1156413723 ## hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala: ## @@ -82,9 +86,19 @@ object HoodieDatasetBulkInsertHelper val keyGenerator = ReflectionUtils.loadClass(keyGeneratorClassName, new TypedProperties(config.getProps)) .asInstanceOf[SparkKeyGeneratorInterface] + val partitionId = TaskContext.getPartitionId() + var rowId = 0 iter.map { row => -val recordKey = keyGenerator.getRecordKey(row, schema) +// auto generate record keys if needed +val recordKey = if (autoGenerateRecordKeys) { + val recKey = HoodieRecord.generateSequenceId(instantTime, partitionId, rowId) + rowId += 1 + UTF8String.fromString(recKey) +} +else { // else use key generator to fetch record key + keyGenerator.getRecordKey(row, schema) Review Comment: for normal ingestion, we don't use empty string. I will respond to your question elsewhere (where we return empty string). its not very apparent. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on a diff in pull request #7881: [HUDI-5723] Automate and standardize enum configs
yihua commented on code in PR #7881: URL: https://github.com/apache/hudi/pull/7881#discussion_r1156364101 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java: ## @@ -147,17 +129,16 @@ public class HoodieCleanConfig extends HoodieConfig { public static final ConfigProperty FAILED_WRITES_CLEANER_POLICY = ConfigProperty .key("hoodie.cleaner.policy.failed.writes") .defaultValue(HoodieFailedWritesCleaningPolicy.EAGER.name()) + .withEnumDocumentation(HoodieFailedWritesCleaningPolicy.class, + "note that LAZY policy is required when multi-writers are enabled.") Review Comment: nit: capitalize the first letter. ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieBootstrapConfig.java: ## @@ -83,8 +79,8 @@ public class HoodieBootstrapConfig extends HoodieConfig { public static final ConfigProperty KEYGEN_TYPE = ConfigProperty .key("hoodie.bootstrap.keygen.type") .defaultValue(KeyGeneratorType.SIMPLE.name()) - .sinceVersion("0.9.0") - .withDocumentation("Type of build-in key generator, currently support SIMPLE, COMPLEX, TIMESTAMP, CUSTOM, NON_PARTITION, GLOBAL_DELETE"); + .withEnumDocumentation(KeyGeneratorType.class, "Key generator class for bootstrap") Review Comment: For the second argument, is the convention to add a period (`.`) at the end or not? I see both in different enum configs. ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java: ## @@ -3028,11 +2993,11 @@ private void validate() { Objects.requireNonNull(writeConfig.getString(BASE_PATH)); if (writeConfig.isEarlyConflictDetectionEnable()) { checkArgument(writeConfig.getString(WRITE_CONCURRENCY_MODE) - .equalsIgnoreCase(WriteConcurrencyMode.OPTIMISTIC_CONCURRENCY_CONTROL.value()), + .equals(WriteConcurrencyMode.OPTIMISTIC_CONCURRENCY_CONTROL.name()), Review Comment: Same here, could we ignore case as before? ## hudi-common/src/main/java/org/apache/hudi/common/config/ConfigProperty.java: ## @@ -139,6 +144,52 @@ public ConfigProperty withDocumentation(String doc) { return new ConfigProperty<>(key, defaultValue, docOnDefaultValue, doc, sinceVersion, deprecatedVersion, inferFunction, validValues, advanced, alternatives); } + public > ConfigProperty withEnumDocumentation(Class e) { +return withEnumDocumentation(e,""); + } + + private > boolean isDefaultField(Class e, Field f) { +if (!hasDefaultValue()) { + return false; +} +if (defaultValue() instanceof String) { + return f.getName().equals(defaultValue()); +} +return Enum.valueOf(e, f.getName()).equals(defaultValue()); + } + + public > ConfigProperty withEnumDocumentation(Class e, String doc, String... internalOption) { Review Comment: Could we rename this as `withDocumentation` and remove `doc` and `internalOption` for simplicity? `doc` content can be merged to `@EnumDescription` . We can mark internal options in the docs. ## hudi-common/src/main/java/org/apache/hudi/common/util/queue/DisruptorWaitStrategyType.java: ## @@ -27,35 +30,50 @@ /** * Enum for the type of waiting strategy in Disruptor Queue. */ +@EnumDescription("Type of waiting strategy in the Disruptor Queue") Review Comment: We can keep the docs the same as before for now. Any docs improvement can be in a separate PR. ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieBootstrapConfig.java: ## @@ -55,12 +52,10 @@ public class HoodieBootstrapConfig extends HoodieConfig { public static final ConfigProperty PARTITION_SELECTOR_REGEX_MODE = ConfigProperty .key("hoodie.bootstrap.mode.selector.regex.mode") - .defaultValue(METADATA_ONLY.name()) - .sinceVersion("0.6.0") - .withValidValues(METADATA_ONLY.name(), FULL_RECORD.name()) - .withDocumentation("Bootstrap mode to apply for partition paths, that match regex above. " - + "METADATA_ONLY will generate just skeleton base files with keys/footers, avoiding full cost of rewriting the dataset. " - + "FULL_RECORD will perform a full copy/rewrite of the data as a Hudi table."); Review Comment: @jonvex I think @lokeshj1703 means that `avoiding full cost of rewriting the dataset` is missing in the new docs to indicate the benefit. ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieBootstrapConfig.java: ## @@ -83,8 +79,8 @@ public class HoodieBootstrapConfig extends HoodieConfig { public static final ConfigProperty KEYGEN_TYPE = ConfigProperty .key("hoodie.bootstrap.keygen.type") .defaultValue(KeyGeneratorType.SIMPLE.name()) - .sinceVersion("0.9.0") - .withDocumentation("Type of build-in key generator, currently support SIMPLE, COMPLEX, TI
[GitHub] [hudi] hudi-bot commented on pull request #8369: [HUDI-6024] Hotfix in MergeIntoHoodieTableCommand::validate with remo…
hudi-bot commented on PR #8369: URL: https://github.com/apache/hudi/pull/8369#issuecomment-1494885734 ## CI report: * 544fc9fba0dbf84c03353dcdaf52b7409d31af40 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16092) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8128: [HUDI-5782] Tweak defaults and remove unnecessary configs after config review
hudi-bot commented on PR #8128: URL: https://github.com/apache/hudi/pull/8128#issuecomment-1494885024 ## CI report: * fca6d63c9ef24cdd0cfe30060a58430d035e0664 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16093) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nfarah86 commented on issue #8365: [SUPPORT] inconsistent Readoptimized view in merge on read table
nfarah86 commented on issue #8365: URL: https://github.com/apache/hudi/issues/8365#issuecomment-1494876289 It's not documented. I'm working on updating documentation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6028) GCS incr source does not handle pubsub message properly
[ https://issues.apache.org/jira/browse/HUDI-6028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-6028: - Sprint: Sprint 2023-04-10 > GCS incr source does not handle pubsub message properly > --- > > Key: HUDI-6028 > URL: https://issues.apache.org/jira/browse/HUDI-6028 > Project: Apache Hudi > Issue Type: Bug > Components: deltastreamer >Reporter: Raymond Xu >Priority: Major > > Gcs event source uses schema converter from spark and won't handle field name > with hyphen in nested column. a sample message > {code:java} > 23/04/03 19:23:45 DEBUG GcsEventsSource: msg: { > "kind": "storage#object", > "id": "", > "selfLink": "", > "name": "", > "bucket": "", > "generation": "1680505551370137", > "metageneration": "1", > "contentType": "application/octet-stream", > "timeCreated": "2023-04-03T07:05:51.373Z", > "updated": "2023-04-03T07:05:51.373Z", > "storageClass": "STANDARD", > "timeStorageClassUpdated": "2023-04-03T07:05:51.373Z", > "size": "6707", > "md5Hash": "", > "mediaLink": "", > "metadata": { > "goog-reserved-file-mtime": "1680503048" > }, > "crc32c": "", > "etag": "" > } > {code} > and it throws > {code} > Exception in thread "main" org.apache.avro.SchemaParseException: Illegal > character in: goog-reserved-file-mtime > at org.apache.avro.Schema.validateName(Schema.java:1571) > at org.apache.avro.Schema.access$400(Schema.java:92) > at org.apache.avro.Schema$Field.(Schema.java:549) > at > org.apache.avro.SchemaBuilder$FieldBuilder.completeField(SchemaBuilder.java:2258) > at > org.apache.avro.SchemaBuilder$FieldBuilder.completeField(SchemaBuilder.java:2254) > at > org.apache.avro.SchemaBuilder$FieldBuilder.access$5100(SchemaBuilder.java:2150) > at > org.apache.avro.SchemaBuilder$GenericDefault.noDefault(SchemaBuilder.java:2557) > at > org.apache.hudi.org.apache.spark.sql.avro.SchemaConverters$.$anonfun$toAvroType$2(SchemaConverters.scala:205) > {code} > This is a problem with org.apache.spark.sql.avro.SchemaConverters#toAvroType -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] xccui closed issue #8305: [SUPPORT] Potential FileSystem http connection leaking
xccui closed issue #8305: [SUPPORT] Potential FileSystem http connection leaking URL: https://github.com/apache/hudi/issues/8305 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xccui commented on issue #8305: [SUPPORT] Potential FileSystem http connection leaking
xccui commented on issue #8305: URL: https://github.com/apache/hudi/issues/8305#issuecomment-1494861495 Hi @danny0405, I looked into this again. You are right, `returnContent()` will release the connection. Actually, I was misled by the code. There will be two `PoolingHttpClientConnectionManager`s at runtime. ``` leaseConnection:306, PoolingHttpClientConnectionManager (com.amazonaws.thirdparty.apache.http.impl.conn) get:282, PoolingHttpClientConnectionManager$1 (com.amazonaws.thirdparty.apache.http.impl.conn) invoke:-1, GeneratedMethodAccessor24 (jdk.internal.reflect) invoke:43, DelegatingMethodAccessorImpl (jdk.internal.reflect) invoke:566, Method (java.lang.reflect) invoke:70, ClientConnectionRequestFactory$Handler (com.amazonaws.http.conn) get:-1, $Proxy51 (com.amazonaws.http.conn) execute:190, MainClientExec (com.amazonaws.thirdparty.apache.http.impl.execchain) execute:186, ProtocolExec (com.amazonaws.thirdparty.apache.http.impl.execchain) doExecute:185, InternalHttpClient (com.amazonaws.thirdparty.apache.http.impl.client) execute:83, CloseableHttpClient (com.amazonaws.thirdparty.apache.http.impl.client) execute:56, CloseableHttpClient (com.amazonaws.thirdparty.apache.http.impl.client) execute:72, SdkHttpClient (com.amazonaws.http.apache.client.impl) executeOneRequest:1346, AmazonHttpClient$RequestExecutor (com.amazonaws.http) executeHelper:1157, AmazonHttpClient$RequestExecutor (com.amazonaws.http) doExecute:814, AmazonHttpClient$RequestExecutor (com.amazonaws.http) executeWithTimer:781, AmazonHttpClient$RequestExecutor (com.amazonaws.http) execute:755, AmazonHttpClient$RequestExecutor (com.amazonaws.http) access$500:715, AmazonHttpClient$RequestExecutor (com.amazonaws.http) execute:697, AmazonHttpClient$RequestExecutionBuilderImpl (com.amazonaws.http) execute:561, AmazonHttpClient (com.amazonaws.http) execute:541, AmazonHttpClient (com.amazonaws.http) invoke:5456, AmazonS3Client (com.amazonaws.services.s3) invoke:5403, AmazonS3Client (com.amazonaws.services.s3) getObjectMetadata:1372, AmazonS3Client (com.amazonaws.services.s3) lambda$getObjectMetadata$10:2545, S3AFileSystem (org.apache.hadoop.fs.s3a) apply:-1, 497983073 (org.apache.hadoop.fs.s3a.S3AFileSystem$$Lambda$1189) retryUntranslated:414, Invoker (org.apache.hadoop.fs.s3a) retryUntranslated:377, Invoker (org.apache.hadoop.fs.s3a) getObjectMetadata:2533, S3AFileSystem (org.apache.hadoop.fs.s3a) getObjectMetadata:2513, S3AFileSystem (org.apache.hadoop.fs.s3a) s3GetFileStatus:3776, S3AFileSystem (org.apache.hadoop.fs.s3a) innerGetFileStatus:3688, S3AFileSystem (org.apache.hadoop.fs.s3a) lambda$getFileStatus$24:3556, S3AFileSystem (org.apache.hadoop.fs.s3a) apply:-1, 718057245 (org.apache.hadoop.fs.s3a.S3AFileSystem$$Lambda$2610) lambda$trackDurationOfOperation$5:499, IOStatisticsBinding (org.apache.hadoop.fs.statistics.impl) apply:-1, 2039613101 (org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding$$Lambda$1168) trackDuration:444, IOStatisticsBinding (org.apache.hadoop.fs.statistics.impl) trackDurationAndSpan:2337, S3AFileSystem (org.apache.hadoop.fs.s3a) trackDurationAndSpan:2356, S3AFileSystem (org.apache.hadoop.fs.s3a) getFileStatus:3554, S3AFileSystem (org.apache.hadoop.fs.s3a) lambda$getFileStatus$17:410, HoodieWrapperFileSystem (org.apache.hudi.common.fs) get:-1, 589863653 (org.apache.hudi.common.fs.HoodieWrapperFileSystem$$Lambda$2609) executeFuncWithTimeMetrics:114, HoodieWrapperFileSystem (org.apache.hudi.common.fs) getFileStatus:404, HoodieWrapperFileSystem (org.apache.hudi.common.fs) checkTableValidity:51, TableNotFoundException (org.apache.hudi.exception) :137, HoodieTableMetaClient (org.apache.hudi.common.table) newMetaClient:689, HoodieTableMetaClient (org.apache.hudi.common.table) access$000:81, HoodieTableMetaClient (org.apache.hudi.common.table) build:770, HoodieTableMetaClient$Builder (org.apache.hudi.common.table) createMetaClient:277, StreamerUtil (org.apache.hudi.util) :118, WriteProfile (org.apache.hudi.sink.partitioner.profile) :44, DeltaWriteProfile (org.apache.hudi.sink.partitioner.profile) getWriteProfile:75, WriteProfiles (org.apache.hudi.sink.partitioner.profile) lambda$singleton$0:64, WriteProfiles (org.apache.hudi.sink.partitioner.profile) apply:-1, 401283836 (org.apache.hudi.sink.partitioner.profile.WriteProfiles$$Lambda$3189) computeIfAbsent:1134, HashMap (java.util) singleton:63, WriteProfiles (org.apache.hudi.sink.partitioner.profile) create:56, BucketAssigners (org.apache.hudi.sink.partitioner) open:122, BucketAssignFunction (org.apache.hudi.sink.partitioner) openFunction:34, FunctionUtils (org.apache.flink.api.common.functions.util) open:100, AbstractUdfStreamOperator (org.apache.flink.streaming.api.operators) open:55, KeyedProcessOperator (org.apache.flink.streaming.api.operators) initia
[jira] [Created] (HUDI-6028) GCS incr source does not handle pubsub message properly
Raymond Xu created HUDI-6028: Summary: GCS incr source does not handle pubsub message properly Key: HUDI-6028 URL: https://issues.apache.org/jira/browse/HUDI-6028 Project: Apache Hudi Issue Type: Bug Components: deltastreamer Reporter: Raymond Xu Gcs event source uses schema converter from spark and won't handle field name with hyphen in nested column. a sample message {code:java} 23/04/03 19:23:45 DEBUG GcsEventsSource: msg: { "kind": "storage#object", "id": "", "selfLink": "", "name": "", "bucket": "", "generation": "1680505551370137", "metageneration": "1", "contentType": "application/octet-stream", "timeCreated": "2023-04-03T07:05:51.373Z", "updated": "2023-04-03T07:05:51.373Z", "storageClass": "STANDARD", "timeStorageClassUpdated": "2023-04-03T07:05:51.373Z", "size": "6707", "md5Hash": "", "mediaLink": "", "metadata": { "goog-reserved-file-mtime": "1680503048" }, "crc32c": "", "etag": "" } {code} and it throws {code} Exception in thread "main" org.apache.avro.SchemaParseException: Illegal character in: goog-reserved-file-mtime at org.apache.avro.Schema.validateName(Schema.java:1571) at org.apache.avro.Schema.access$400(Schema.java:92) at org.apache.avro.Schema$Field.(Schema.java:549) at org.apache.avro.SchemaBuilder$FieldBuilder.completeField(SchemaBuilder.java:2258) at org.apache.avro.SchemaBuilder$FieldBuilder.completeField(SchemaBuilder.java:2254) at org.apache.avro.SchemaBuilder$FieldBuilder.access$5100(SchemaBuilder.java:2150) at org.apache.avro.SchemaBuilder$GenericDefault.noDefault(SchemaBuilder.java:2557) at org.apache.hudi.org.apache.spark.sql.avro.SchemaConverters$.$anonfun$toAvroType$2(SchemaConverters.scala:205) {code} This is a problem with org.apache.spark.sql.avro.SchemaConverters#toAvroType -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] xccui commented on issue #8060: [SUPPORT] An instant exception occurs when the flink job is restarted
xccui commented on issue #8060: URL: https://github.com/apache/hudi/issues/8060#issuecomment-1494850427 I hit the same issue. Just feel that the current asynchronous operations are a bit fragile. I believe sometimes tasks in a Flink job will be in a zombie state before they get killed. In that case, Hudi will see multiple writers. If we know that could happen, is it possible to avoid it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org