Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]
hudi-bot commented on PR #10980: URL: https://github.com/apache/hudi/pull/10980#issuecomment-2044184717 ## CI report: * c382de2b71540404831449de82e40d9488a38575 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23155) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Duplicate Row in Same Partition using Global Bloom Index [hudi]
Raghvendradubey commented on issue #9536: URL: https://github.com/apache/hudi/issues/9536#issuecomment-2044164961 Hi @ad1happy2go @nsivabalan After migrating to new Hudi version 0.14.0 I didn't face this issue again, thanks for your support. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] spark stuctrued streaming failed to update MDT metadata [hudi]
Qiuzhuang commented on issue #10891: URL: https://github.com/apache/hudi/issues/10891#issuecomment-2044133901 > but woudn't the inprocess lock provider kick in? and should avoid multiple writers to MDT. I am assuming the setup is, spark streaming w/ async compaction or clustering. A single process, but multiple thread trying to ingest to MDT. if in process lock provider is not kicking in, then its a bug. If async clustering is in the same process, we don't run into issue for now. But for multiple writes like offline clustering in another process, as indicated by @danny0405, we should have ZK lock provider to serialize MDT write. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]
hudi-bot commented on PR #10980: URL: https://github.com/apache/hudi/pull/10980#issuecomment-2044130880 ## CI report: * 36b0e8f8e5e00096b9844f8db6cc51cbc114f42c Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23148) * c382de2b71540404831449de82e40d9488a38575 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23155) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]
hudi-bot commented on PR #10980: URL: https://github.com/apache/hudi/pull/10980#issuecomment-2044125667 ## CI report: * 36b0e8f8e5e00096b9844f8db6cc51cbc114f42c Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23148) * c382de2b71540404831449de82e40d9488a38575 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7391] HoodieMetadataMetrics should use Metrics instance for metrics registry [hudi]
nsivabalan commented on code in PR #10635: URL: https://github.com/apache/hudi/pull/10635#discussion_r1556835447 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataWriteUtils.java: ## @@ -200,6 +200,11 @@ public static HoodieWriteConfig createMetadataWriteConfig( builder.withProperties(datadogConfig.build().getProps()); break; case PROMETHEUS: + HoodieMetricsPrometheusConfig prometheusConfig = HoodieMetricsPrometheusConfig.newBuilder() + .withPushgatewayLabels(writeConfig.getPushGatewayLabels()) + .withPrometheusPortNum(writeConfig.getPrometheusPort()).build(); Review Comment: I checked Prometheus reporter and we need only prometheus port and push gateway labels -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7391] HoodieMetadataMetrics should use Metrics instance for metrics registry [hudi]
nsivabalan commented on code in PR #10635: URL: https://github.com/apache/hudi/pull/10635#discussion_r1556836048 ## hudi-common/src/main/java/org/apache/hudi/metrics/Metrics.java: ## @@ -176,4 +190,16 @@ public static boolean isInitialized(String basePath) { } return false; } + + /** + * Use the same base path as the hudi table so that Metrics instance is shared. + */ + private static String getBasePath(HoodieMetricsConfig metricsConfig) { +String basePath = metricsConfig.getBasePath(); +if (basePath.endsWith(HoodieTableMetaClient.METADATA_TABLE_FOLDER_PATH)) { Review Comment: my bad. ``` public static final String METADATA_TABLE_FOLDER_PATH = METAFOLDER_NAME + Path.SEPARATOR + "metadata"; ``` looks like we already account for what I asked for -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7395] Fix computation for metrics in HoodieMetadataMetrics [hudi]
nsivabalan commented on PR #10641: URL: https://github.com/apache/hudi/pull/10641#issuecomment-2044100016 hey @prashantwason : lets de-couple the fixes. a. Fixing MDT to emit writer side metrics(commit duration, compaction duration etc) b. Fixing MDT to emit reader side metrics (col stats look up duration etc) during distributed registry. I feel we should focus on (a) in this patch and get it landed. and you can put out a patch (I assume you folks already have a fix) for distributed registry based metrics from the executor. If you are aligned on that, let us know if you have any feedback on this patch. or if we are good to go ahead. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]
beyond1920 commented on code in PR #10980: URL: https://github.com/apache/hudi/pull/10980#discussion_r1556818304 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java: ## @@ -147,6 +149,13 @@ public HoodieMergeHandle(HoodieWriteConfig config, String instantTime, HoodieTab this.preserveMetadata = true; init(fileId, this.partitionPath, dataFileToBeMerged); validateAndSetAndKeyGenProps(keyGeneratorOpt, config.populateMetaFields()); +// The compactor avoids heavy rewriting when copy the old record from old base file into new base file +if (config.populateMetaFields()) { + LOG.info("Using update instead rewriting during compaction"); Review Comment: > Set the log as debug level Using info level here does not cost much, right? It only prints logs in class constructor, not for each input record. > "instead" -> "instead of". Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]
beyond1920 commented on code in PR #10980: URL: https://github.com/apache/hudi/pull/10980#discussion_r1556818304 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java: ## @@ -147,6 +149,13 @@ public HoodieMergeHandle(HoodieWriteConfig config, String instantTime, HoodieTab this.preserveMetadata = true; init(fileId, this.partitionPath, dataFileToBeMerged); validateAndSetAndKeyGenProps(keyGeneratorOpt, config.populateMetaFields()); +// The compactor avoids heavy rewriting when copy the old record from old base file into new base file +if (config.populateMetaFields()) { + LOG.info("Using update instead rewriting during compaction"); Review Comment: > Set the log as debug level Using info level here does not cost much, right? It only prints logs in class constructor, not for each input record. > "instead" -> "instead of". Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]
beyond1920 commented on code in PR #10980: URL: https://github.com/apache/hudi/pull/10980#discussion_r1556818304 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java: ## @@ -147,6 +149,13 @@ public HoodieMergeHandle(HoodieWriteConfig config, String instantTime, HoodieTab this.preserveMetadata = true; init(fileId, this.partitionPath, dataFileToBeMerged); validateAndSetAndKeyGenProps(keyGeneratorOpt, config.populateMetaFields()); +// The compactor avoids heavy rewriting when copy the old record from old base file into new base file +if (config.populateMetaFields()) { + LOG.info("Using update instead rewriting during compaction"); Review Comment: > Set the log as debug level Using info level here does not cost much, right? > "instead" -> "instead of". Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]
beyond1920 commented on code in PR #10980: URL: https://github.com/apache/hudi/pull/10980#discussion_r1556817370 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java: ## @@ -147,6 +149,13 @@ public HoodieMergeHandle(HoodieWriteConfig config, String instantTime, HoodieTab this.preserveMetadata = true; init(fileId, this.partitionPath, dataFileToBeMerged); validateAndSetAndKeyGenProps(keyGeneratorOpt, config.populateMetaFields()); +// if the old schema equals to the new schema, avoid heavy rewriting +if (config.populateMetaFields() && useWriterSchemaForCompaction) { + LOG.info("Using update instead rewriting during compaction"); + copyOldFunc = (key, record, schema, prop) -> this.updateMetadataToOldRecord(key, record, schema, prop); Review Comment: Good question. The responsible of this method is only merging base record and incremental record, not including handle schema evolution. Handling schema evolution happens before call the `HoodieMergeHandle#write` method. https://github.com/apache/hudi/assets/1525333/3a03e08b-fe2e-4da6-a788-07cbb6feeadd;> https://github.com/apache/hudi/assets/1525333/def1f2ee-ed97-47f8-92b6-76d45500bea7;> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]
beyond1920 commented on code in PR #10980: URL: https://github.com/apache/hudi/pull/10980#discussion_r1556818304 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java: ## @@ -147,6 +149,13 @@ public HoodieMergeHandle(HoodieWriteConfig config, String instantTime, HoodieTab this.preserveMetadata = true; init(fileId, this.partitionPath, dataFileToBeMerged); validateAndSetAndKeyGenProps(keyGeneratorOpt, config.populateMetaFields()); +// The compactor avoids heavy rewriting when copy the old record from old base file into new base file +if (config.populateMetaFields()) { + LOG.info("Using update instead rewriting during compaction"); Review Comment: > Set the log as debug level, -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] spark stuctrued streaming failed to update MDT metadata [hudi]
xicm commented on issue #10891: URL: https://github.com/apache/hudi/issues/10891#issuecomment-2044061311 The root cause is the deltacommit in MDT rollbacks the compaction instant(compaction in MDT is a deltacommit) in MDT. When a compaction starts, it will create a **inflight DeltaCommit** in MDT, because the compaction is asynchronous, the data ingestion will go on, the writer will start a new delta commit both in data table and MDT. In MDT, the new deltacommit will rollback the uncompleted deltacommit(it is created by the async compaction). Is it possible to filter the deltacommit created by compaction in MDT when we do rollback? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT]Exception when executing log compaction : Unsupported Operation Exception [hudi]
MrAladdin opened a new issue, #10982: URL: https://github.com/apache/hudi/issues/10982 **Describe the problem you faced** 1、spark upsert hudi(mor) 2、exception when executing log compaction : Unsupported Operation Exception 3、org.apache.hudi.exception.HoodieRollbackException: Unknown listing type, during rollback of [==>20240409000634923005__logcompaction__INFLIGHT] I also want to know why after a log compaction exception, it remains in an inflight state, and the program does not exit abnormally. **Environment Description** * Hudi version :0.14.1 * Spark version :3.4.1 * Hive version :3.1.2 * Hadoop version :3.1.3 * Storage (HDFS/S3/GCS..) :hdfs * Running on Docker? (yes/no) :no **Additional context** .option("hoodie.metadata.enable", "true") .option("hoodie.metadata.index.async", "false") .option("hoodie.metadata.index.check.timeout.seconds", "900") .option("hoodie.auto.adjust.lock.configs", "true") .option("hoodie.metadata.optimized.log.blocks.scan.enable", "true") .option("hoodie.metadata.metrics.enable", "false") .option("hoodie.metadata.index.column.stats.enable", "false") .option("hoodie.metadata.compact.max.delta.commits", "10") .option("hoodie.metadata.record.index.enable", "true") .option("hoodie.index.type", "RECORD_INDEX") .option("hoodie.metadata.max.init.parallelism", "10") .option("hoodie.metadata.record.index.min.filegroup.count", "10") .option("hoodie.metadata.record.index.max.filegroup.count", "1") .option("hoodie.metadata.record.index.max.filegroup.size", "1073741824") .option("hoodie.metadata.auto.initialize", "true") .option("hoodie.metadata.record.index.growth.factor", "2.0") .option("hoodie.metadata.max.logfile.size", "2147483648") .option("hoodie.metadata.log.compaction.enable", "true") .option("hoodie.metadata.log.compaction.blocks.threshold", "5") .option("hoodie.write.concurrency.mode", "optimistic_concurrency_control") .option("hoodie.write.lock.provider", "org.apache.hudi.client.transaction.lock.FileSystemBasedLockProvider") .option("hoodie.write.lock.filesystem.expire", "10") **Stacktrace** one exception: Job aborted due to stage failure: Task 6 in stage 203.0 failed 4 times, most recent failure: Lost task 6.3 in stage 203.0 (TID 4263) (11.slave.hdp executor 13): org.apache.hudi.exception.HoodieException: Unsupported Operation Exception at org.apache.hudi.common.util.collection.BitCaskDiskMap.values(BitCaskDiskMap.java:302) at org.apache.hudi.common.util.collection.ExternalSpillableMap.values(ExternalSpillableMap.java:275) at org.apache.hudi.table.HoodieSparkMergeOnReadTable.handleInsertsForLogCompaction(HoodieSparkMergeOnReadTable.java:206) at org.apache.hudi.table.action.compact.LogCompactionExecutionHelper.writeFileAndGetWriteStats(LogCompactionExecutionHelper.java:79) at org.apache.hudi.table.action.compact.HoodieCompactor.compact(HoodieCompactor.java:237) at org.apache.hudi.table.action.compact.HoodieCompactor.lambda$compact$988df80a$1(HoodieCompactor.java:132) at org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070) at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:223) at org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:352) at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1552) at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1462) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1526) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1349) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:375) at org.apache.spark.rdd.RDD.iterator(RDD.scala:326) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at org.apache.spark.scheduler.Task.run(Task.scala:139) at
Re: [I] [Inquiry] Does HoodieIndexer can Do Indexing for RLI Async Fashion [hudi]
nsivabalan commented on issue #10815: URL: https://github.com/apache/hudi/issues/10815#issuecomment-2044048808 hey @ad1happy2go @codope : looks like there is some mis understanging on how to use async indexer. when enabling async indexer to build say RLI, ingestion also should have async indexing enable for RLI. we can't completely disable from regular ingestion job. Can you folks follow up on any doc enhancements. CC @soumilshah1995 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Duplicate Row in Same Partition using Global Bloom Index [hudi]
nsivabalan commented on issue #9536: URL: https://github.com/apache/hudi/issues/9536#issuecomment-2044042739 hey @Raghvendradubey : any follow ups on this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]Data loss occurs when using bulkinsert [hudi]
nsivabalan commented on issue #9748: URL: https://github.com/apache/hudi/issues/9748#issuecomment-2044042481 hey @ad1happy2go : any follow up on this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] After enable speculation execution of spark compaction job, some broken parquet files might be generated [hudi]
nsivabalan commented on issue #9615: URL: https://github.com/apache/hudi/issues/9615#issuecomment-2044040888 We gonna attempt at fixing the issue on this using completion markers. Will post an update shortly on how we plan to tackle this. But in the mean time, curious to know how you folks are detecting these additional parquet files. There are chances it could lead to duplicates right? how you folks are managing to void data consistency issues? Until we have a proper fix, trying to gauge if we can suggest some workarounds for other Hudi OSS users. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Enable Hudi Metadata Table and Multi-Modal Index bug [hudi]
nsivabalan commented on issue #9672: URL: https://github.com/apache/hudi/issues/9672#issuecomment-2044037688 hey @MorningGlow : any follow ups on this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] too many s3 list when hoodie.metadata.enable=true [hudi]
nsivabalan commented on issue #9751: URL: https://github.com/apache/hudi/issues/9751#issuecomment-2044036786 hey @njalan @BruceKellan : any follow ups on this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Facing java.util.NoSuchElementException on EMR 6.12 (Hudi 0.13) with inline compaction and cleaning on MoR tables [hudi]
nsivabalan commented on issue #9861: URL: https://github.com/apache/hudi/issues/9861#issuecomment-2044035691 hey @ad1happy2go : any follow ups on this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Compaction error [hudi]
nsivabalan commented on issue #9885: URL: https://github.com/apache/hudi/issues/9885#issuecomment-2044033752 hey @ad1happy2go : reminder to follow up on this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] AWS Athena query fail when compaction is scheduled for MOR table [hudi]
nsivabalan commented on issue #9907: URL: https://github.com/apache/hudi/issues/9907#issuecomment-2044029051 hey @codope @rahil-c : is athena querying hudi related issues are all fixed as of now? or do we still have any pending gaps. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Data loss in MOR table after clustering partition [hudi]
nsivabalan commented on issue #9977: URL: https://github.com/apache/hudi/issues/9977#issuecomment-2044027211 hey @ad1happy2go : whats the follow up on this. do we need to make any fixes to hudi. or doc enhancements etc? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Query failure due to replacecommit being archived [hudi]
nsivabalan commented on issue #10107: URL: https://github.com/apache/hudi/issues/10107#issuecomment-2044026284 hey @haoxie-aws : the link PRs should fix the issue reported. are you facing the issue after 0.14.1 as well ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Additional records in dataset after clustering [hudi]
nsivabalan commented on issue #10172: URL: https://github.com/apache/hudi/issues/10172#issuecomment-2044025853 hey @noahtaite : any follow ups on this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Compaction & Clustering are not working [hudi]
nsivabalan commented on issue #10183: URL: https://github.com/apache/hudi/issues/10183#issuecomment-2044025493 hey @ad1happy2go : can you follow up on this. @Cpandey43 : yes you are right. enabling async w/ batch writers like spark-ds does not mean much. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] INSERT_OVERWRITE_TABLE on subsequent runs fails with a metadata file not found error (v0.14.0) [hudi]
nsivabalan commented on issue #10445: URL: https://github.com/apache/hudi/issues/10445#issuecomment-2044023506 just to get past the issue, you can completely delete the table and rewrite. or use overwrite mode w/ spark. until we have a proper fix. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7575] avoid repeated fetching of pending replace instants [hudi]
danny0405 commented on code in PR #10976: URL: https://github.com/apache/hudi/pull/10976#discussion_r1556736368 ## hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java: ## @@ -140,6 +141,22 @@ protected void init(HoodieTableMetaClient metaClient, HoodieTimeline visibleActi */ protected void refreshTimeline(HoodieTimeline visibleActiveTimeline) { this.visibleCommitsAndCompactionTimeline = visibleActiveTimeline.getWriteTimeline(); +this.timelineHashAndPendingReplaceInstants = null; + } + + /** + * Get a list of pending replace instants. Caches the result for the active timeline. + * The cache is invalidated when {@link #refreshTimeline(HoodieTimeline)} is called. + * + * @return list of pending replace instant timestamps + */ + private List getPendingReplaceInstants() { +HoodieActiveTimeline activeTimeline = metaClient.getActiveTimeline(); Review Comment: > Can't multiple threads access the same timeline? It could, and we should introduce some synchronized code guard for the access of the cache, we already did that for some caches in the timeline. > What do you mean by "map cache"? My typo, it's the "Pair" cache here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Upsert operation not working and job is running longer while using "Record level index" in Apache Hudi 0.14 in EMR 6.15 [hudi]
nsivabalan commented on issue #10587: URL: https://github.com/apache/hudi/issues/10587#issuecomment-2043999240 hey @ad1happy2go : do let me know if we find any data consistency issues w/ MDT or RLI. thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] RLI Spark Hudi Error occurs when executing map [hudi]
nsivabalan commented on issue #10609: URL: https://github.com/apache/hudi/issues/10609#issuecomment-2043998416 and @ad1happy2go : if you encounter any bugs wrt MDT or RLI, do keep me posted. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] RLI Spark Hudi Error occurs when executing map [hudi]
nsivabalan commented on issue #10609: URL: https://github.com/apache/hudi/issues/10609#issuecomment-2043998156 hey @bksrepo : can you file a new issue hey @ad1happy2go : if the original issue is resolved, can we close it out. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] File not found while using metadata table for insert_overwrite table [hudi]
nsivabalan commented on issue #10628: URL: https://github.com/apache/hudi/issues/10628#issuecomment-2043996684 hey @ad1happy2go : if this turns out to be MDT data consistency issue, do keep me posted. thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-7574) Auto-pilot for Flink Hudi sink tasks
[ https://issues.apache.org/jira/browse/HUDI-7574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17835092#comment-17835092 ] Vinoth Chandar commented on HUDI-7574: -- We need to rethink these singleton tasks like cleaning etc. > Auto-pilot for Flink Hudi sink tasks > > > Key: HUDI-7574 > URL: https://issues.apache.org/jira/browse/HUDI-7574 > Project: Apache Hudi > Issue Type: Improvement > Components: flink >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Fix For: 1.0.0 > > > Currently the flink write task parallelism is set up through > {code:java} > write.tasks{code} > it is kind of a fixed number during the lifecycle of the ingestion pipeline, > while for streaming, there are always fluctuation of the workload, it is > great if we can tune the parallelism of write tasks based on the job load > profile dynamically. > On K8s, Flink provides a > [https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/autoscaler/] > which is suitable for the purpose, which deserves a further investigation. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7574) Auto-pilot for Flink Hudi sink tasks
[ https://issues.apache.org/jira/browse/HUDI-7574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-7574: - Status: In Progress (was: Open) > Auto-pilot for Flink Hudi sink tasks > > > Key: HUDI-7574 > URL: https://issues.apache.org/jira/browse/HUDI-7574 > Project: Apache Hudi > Issue Type: Improvement > Components: flink >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Fix For: 1.0.0 > > > Currently the flink write task parallelism is set up through > {code:java} > write.tasks{code} > it is kind of a fixed number during the lifecycle of the ingestion pipeline, > while for streaming, there are always fluctuation of the workload, it is > great if we can tune the parallelism of write tasks based on the job load > profile dynamically. > On K8s, Flink provides a > [https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/autoscaler/] > which is suitable for the purpose, which deserves a further investigation. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [I] [SUPPORT] Duplicate data in base file of MOR table [hudi]
nsivabalan commented on issue #10882: URL: https://github.com/apache/hudi/issues/10882#issuecomment-2043992885 hey @ad1happy2go : if this is related to MDT, can you let me know. I am trying to take stock of all MDT data consistency related issues. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7577) Avoid MDT compaction instant time conflicts
[ https://issues.apache.org/jira/browse/HUDI-7577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-7577: - Status: In Progress (was: Open) > Avoid MDT compaction instant time conflicts > --- > > Key: HUDI-7577 > URL: https://issues.apache.org/jira/browse/HUDI-7577 > Project: Apache Hudi > Issue Type: Improvement > Components: core >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7572) Avoid to schedule empty compaction plan without log files
[ https://issues.apache.org/jira/browse/HUDI-7572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-7572: - Reviewers: Ethan Guo, Sagar Sumit > Avoid to schedule empty compaction plan without log files > - > > Key: HUDI-7572 > URL: https://issues.apache.org/jira/browse/HUDI-7572 > Project: Apache Hudi > Issue Type: Improvement > Components: table-service >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > After change to [loosen the compaction for > MDT|https://issues.apache.org/jira/browse/HUDI-7572], there is rare case the > same compaction instant time got used to schedule for multiple times, we > better optimize the compactor to avoid empty compaction plan generation. > Note: although we have a active timeline check to avoid the repetative > scheduling, there is still little chance the compaction already got archived. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [I] [SUPPORT] IllegalArgumentException at org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:33) [hudi]
nsivabalan commented on issue #10906: URL: https://github.com/apache/hudi/issues/10906#issuecomment-2043989098 CC @linliu-code -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] No way to clean `archived/` folder [hudi]
nsivabalan commented on issue #10930: URL: https://github.com/apache/hudi/issues/10930#issuecomment-2043988319 may be we should introduce a ArchivalClean table service to auto clean files older than say 2 months. Not many users are going to inspect archival timeline after 2+ months. and it will avoid accumulating entire history. Interested users can still choose to not clean it up. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [Feature Inquiry] index for randomized upserts [hudi]
nsivabalan commented on issue #10961: URL: https://github.com/apache/hudi/issues/10961#issuecomment-2043987312 just a note. 0.14.1 RLI is a substitute for global index and not any index. for eg, if you were using bloom, you can't replace it w/ RLI. Current RLI cannot support same record keys across two partitions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Rollback failed clustering 0.12.2 [hudi]
nsivabalan commented on issue #10964: URL: https://github.com/apache/hudi/issues/10964#issuecomment-2043986341 hey @suryaprasanna : Can you take this up and offer some suggestions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7575] avoid repeated fetching of pending replace instants [hudi]
the-other-tim-brown commented on code in PR #10976: URL: https://github.com/apache/hudi/pull/10976#discussion_r1556695938 ## hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java: ## @@ -140,6 +141,22 @@ protected void init(HoodieTableMetaClient metaClient, HoodieTimeline visibleActi */ protected void refreshTimeline(HoodieTimeline visibleActiveTimeline) { this.visibleCommitsAndCompactionTimeline = visibleActiveTimeline.getWriteTimeline(); +this.timelineHashAndPendingReplaceInstants = null; + } + + /** + * Get a list of pending replace instants. Caches the result for the active timeline. + * The cache is invalidated when {@link #refreshTimeline(HoodieTimeline)} is called. + * + * @return list of pending replace instant timestamps + */ + private List getPendingReplaceInstants() { +HoodieActiveTimeline activeTimeline = metaClient.getActiveTimeline(); Review Comment: > > It seems like it may make sense long term to return the same instance whenever possible to benefit from this cache. > > There should be no much difference because the map cache you use also has per-timeline granularity. The benefit to move to the timeline itself is for better maintainance. > What do you mean by "map cache"? > And if we move the cache inside the timeline, there should not be thread access conflicts. Why is that? Can't multiple threads access the same timeline? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]
danny0405 commented on code in PR #10980: URL: https://github.com/apache/hudi/pull/10980#discussion_r1556687381 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java: ## @@ -147,6 +149,13 @@ public HoodieMergeHandle(HoodieWriteConfig config, String instantTime, HoodieTab this.preserveMetadata = true; init(fileId, this.partitionPath, dataFileToBeMerged); validateAndSetAndKeyGenProps(keyGeneratorOpt, config.populateMetaFields()); +// if the old schema equals to the new schema, avoid heavy rewriting +if (config.populateMetaFields() && useWriterSchemaForCompaction) { + LOG.info("Using update instead rewriting during compaction"); + copyOldFunc = (key, record, schema, prop) -> this.updateMetadataToOldRecord(key, record, schema, prop); Review Comment: but it still uses the latest schema as the write schema, how about the schema already evolved? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]
danny0405 commented on code in PR #10980: URL: https://github.com/apache/hudi/pull/10980#discussion_r1556687736 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java: ## @@ -147,6 +149,13 @@ public HoodieMergeHandle(HoodieWriteConfig config, String instantTime, HoodieTab this.preserveMetadata = true; init(fileId, this.partitionPath, dataFileToBeMerged); validateAndSetAndKeyGenProps(keyGeneratorOpt, config.populateMetaFields()); +// The compactor avoids heavy rewriting when copy the old record from old base file into new base file +if (config.populateMetaFields()) { + LOG.info("Using update instead rewriting during compaction"); Review Comment: Set the log as debug level, "instead" -> "instead of". -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7503] Compaction and LogCompaction executions should start a heartbeat on every attempt and block concurrent executions of same plan [hudi]
danny0405 commented on code in PR #10965: URL: https://github.com/apache/hudi/pull/10965#discussion_r1556682595 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java: ## @@ -1135,8 +1137,34 @@ protected void completeLogCompaction(HoodieCommitMetadata metadata, HoodieTable */ protected HoodieWriteMetadata compact(String compactionInstantTime, boolean shouldComplete) { HoodieTable table = createTable(config, context.getHadoopConf().get()); +Option instantToCompactOption = Option.fromJavaOptional(table.getActiveTimeline() +.filterCompletedAndCompactionInstants() +.getInstants() +.stream() +.filter(instant -> HoodieActiveTimeline.EQUALS.test(instant.getTimestamp(), compactionInstantTime)) Review Comment: we should only care about pending instant right? If the compaction already completed, just skip this run. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7503] Compaction and LogCompaction executions should start a heartbeat on every attempt and block concurrent executions of same plan [hudi]
danny0405 commented on code in PR #10965: URL: https://github.com/apache/hudi/pull/10965#discussion_r1556682151 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java: ## @@ -1135,8 +1138,36 @@ protected void completeLogCompaction(HoodieCommitMetadata metadata, HoodieTable */ protected HoodieWriteMetadata compact(String compactionInstantTime, boolean shouldComplete) { HoodieTable table = createTable(config, context.getHadoopConf().get()); +Option instantToCompactOption = Option.fromJavaOptional(table.getActiveTimeline() +.filterCompletedAndCompactionInstants() +.getInstants() +.stream() +.filter(instant -> HoodieActiveTimeline.EQUALS.test(instant.getTimestamp(), compactionInstantTime)) +.findFirst()); +try { + // Transaction serves to ensure only one compact job for this instant will start heartbeat, and any other concurrent + // compact job will abort if they attempt to execute compact before heartbeat expires + // Note that as long as all jobs for this table use this API for compact, then this alone should prevent + // compact rollbacks from running concurrently to compact commits. + txnManager.beginTransaction(instantToCompactOption, txnManager.getLastCompletedTransactionOwner()); Review Comment: yeah, even if the state is requested, we should check the heartbeat liveness. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7503] Compaction and LogCompaction executions should start a heartbeat on every attempt and block concurrent executions of same plan [hudi]
danny0405 commented on code in PR #10965: URL: https://github.com/apache/hudi/pull/10965#discussion_r1554475930 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java: ## @@ -1135,8 +1138,36 @@ protected void completeLogCompaction(HoodieCommitMetadata metadata, HoodieTable */ protected HoodieWriteMetadata compact(String compactionInstantTime, boolean shouldComplete) { HoodieTable table = createTable(config, context.getHadoopConf().get()); +Option instantToCompactOption = Option.fromJavaOptional(table.getActiveTimeline() +.filterCompletedAndCompactionInstants() +.getInstants() +.stream() +.filter(instant -> HoodieActiveTimeline.EQUALS.test(instant.getTimestamp(), compactionInstantTime)) +.findFirst()); +try { + // Transaction serves to ensure only one compact job for this instant will start heartbeat, and any other concurrent + // compact job will abort if they attempt to execute compact before heartbeat expires + // Note that as long as all jobs for this table use this API for compact, then this alone should prevent + // compact rollbacks from running concurrently to compact commits. + txnManager.beginTransaction(instantToCompactOption, txnManager.getLastCompletedTransactionOwner()); Review Comment: When a conflict for the same compaction instant execution is detected, we can: 1. check the state of the instant, if it is in `INFLIGHT` state and 1.1) the heartbeat expires, we can just rollback the last execution and reattempt in this run; 1.2) if the heartbeat does not expire, just take the execution of this run and log a wanning log there. 2. if the state if still `REQUESTED`, we can execute it direcly? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7575] avoid repeated fetching of pending replace instants [hudi]
danny0405 commented on code in PR #10976: URL: https://github.com/apache/hudi/pull/10976#discussion_r1556677633 ## hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java: ## @@ -140,6 +141,22 @@ protected void init(HoodieTableMetaClient metaClient, HoodieTimeline visibleActi */ protected void refreshTimeline(HoodieTimeline visibleActiveTimeline) { this.visibleCommitsAndCompactionTimeline = visibleActiveTimeline.getWriteTimeline(); +this.timelineHashAndPendingReplaceInstants = null; + } + + /** + * Get a list of pending replace instants. Caches the result for the active timeline. + * The cache is invalidated when {@link #refreshTimeline(HoodieTimeline)} is called. + * + * @return list of pending replace instant timestamps + */ + private List getPendingReplaceInstants() { +HoodieActiveTimeline activeTimeline = metaClient.getActiveTimeline(); Review Comment: > It seems like it may make sense long term to return the same instance whenever possible to benefit from this cache. There should be no much difference because the map cache you use also has per-timeline granularity. The benefit to move to the timeline itself is for better maintainance. And if we move the cache inside the timeline, there should not be thread access conflicts. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7503] Compaction and LogCompaction executions should start a heartbeat on every attempt and block concurrent executions of same plan [hudi]
hudi-bot commented on PR #10965: URL: https://github.com/apache/hudi/pull/10965#issuecomment-2043939357 ## CI report: * e1a6e4a24083dd8871a2fc3fbb289e1a6192593a Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23154) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7395] Fix computation for metrics in HoodieMetadataMetrics [hudi]
prashantwason commented on code in PR #10641: URL: https://github.com/apache/hudi/pull/10641#discussion_r1556617105 ## hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataMetrics.java: ## @@ -136,7 +144,7 @@ public void updateMetrics(String action, long durationInMs) { String countKey = action + ".count"; String durationKey = action + ".totalDuration"; incrementMetric(countKey, 1); -incrementMetric(durationKey, durationInMs); +setMetric(durationKey, durationInMs); Review Comment: You are assuming that code calling these functions would only call once. That may not be a correct assumption for all cases - opening the MDT is costly so multiple lookups etc can be called on open MDT readers. ## hudi-common/src/main/java/org/apache/hudi/metadata/BaseTableMetadata.java: ## @@ -302,8 +303,8 @@ public Map readRecordIndex(List reco }); metrics.ifPresent(m -> m.updateMetrics(HoodieMetadataMetrics.LOOKUP_RECORD_INDEX_TIME_STR, timer.endTimer())); -metrics.ifPresent(m -> m.updateMetrics(HoodieMetadataMetrics.LOOKUP_RECORD_INDEX_KEYS_COUNT_STR, recordKeys.size())); -metrics.ifPresent(m -> m.updateMetrics(HoodieMetadataMetrics.LOOKUP_RECORD_INDEX_KEYS_HITS_COUNT_STR, recordKeyToLocation.size())); +metrics.ifPresent(m -> m.setMetric(HoodieMetadataMetrics.LOOKUP_RECORD_INDEX_KEYS_COUNT_STR, recordKeys.size())); Review Comment: The same HoodieTableMetadata object can be used to lookup keys from MDT multiple times. In that case, update is more accurate. ## hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataMetrics.java: ## @@ -73,10 +79,12 @@ public class HoodieMetadataMetrics implements Serializable { private static final Logger LOG = LoggerFactory.getLogger(HoodieMetadataMetrics.class); - private final Registry metricsRegistry; + private final transient MetricRegistry metricsRegistry; + private final transient Metrics metrics; - public HoodieMetadataMetrics(Registry metricsRegistry) { -this.metricsRegistry = metricsRegistry; + public HoodieMetadataMetrics(HoodieMetricsConfig metricsConfig) { Review Comment: If you do not use Registry then no metrics can be collected from the executors where most of the operations on the MDT readers take place (for indexes other than files index). Eg. RI lookup -> since there are multiple file groups in record_index, when looking up keys from the record index, each executor opens one file group of the record index and reads the keys that belong to that file group. When HoodieTableMetadata is serialized by Spark and send to the executors, the executors end up updating a local copy of the metadata metrics. since the publishing of the metrics is only done on the driver side, the metrics updated on the executor side never make it to the driver side and hence never published. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7503] Compaction and LogCompaction executions should start a heartbeat on every attempt and block concurrent executions of same plan [hudi]
hudi-bot commented on PR #10965: URL: https://github.com/apache/hudi/pull/10965#issuecomment-2043855651 ## CI report: * c41af6435281865147967768419da5e4fb688f8b Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23153) * e1a6e4a24083dd8871a2fc3fbb289e1a6192593a Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23154) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7503] Compaction and LogCompaction executions should start a heartbeat on every attempt and block concurrent executions of same plan [hudi]
hudi-bot commented on PR #10965: URL: https://github.com/apache/hudi/pull/10965#issuecomment-2043839847 ## CI report: * c41af6435281865147967768419da5e4fb688f8b Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23153) * e1a6e4a24083dd8871a2fc3fbb289e1a6192593a UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi CLI bundle not working [hudi]
mansipp commented on issue #10566: URL: https://github.com/apache/hudi/issues/10566#issuecomment-2043833097 Getting the similar error while running the `commit rollback` ``` commit rollback --commit 20240408231846380 24/04/08 23:22:02 INFO InputStreamConsumer: Apr 08, 2024 11:22:02 PM org.apache.spark.launcher.Log4jHotPatchOption staticJavaAgentOption 24/04/08 23:22:02 INFO InputStreamConsumer: WARNING: spark.log4jHotPatch.enabled is set to true, but /usr/share/log4j-cve-2021-44228-hotpatch/jdk17/Log4jHotPatchFat.jar does not exist at the configured location 24/04/08 23:22:02 INFO InputStreamConsumer: 24/04/08 23:22:03 INFO InputStreamConsumer: Error: Failed to load org.apache.hudi.cli.commands.SparkMain: org/apache/hudi/common/engine/HoodieEngineContext 24/04/08 23:22:03 INFO InputStreamConsumer: 24/04/08 23:22:03 INFO ShutdownHookManager: Shutdown hook called 24/04/08 23:22:03 INFO InputStreamConsumer: 24/04/08 23:22:03 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-272bb6ef-f858-42a6-b9d0-9614f1f36371 24/04/08 23:22:03 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from s3://mansipp-emr-dev/hudi_cli_migration/tables/mor/mansipp_hudi_mor_table_2/ 24/04/08 23:22:03 INFO HoodieTableConfig: Loading table properties from s3://mansipp-emr-dev/hudi_cli_migration/tables/mor/mansipp_hudi_mor_table_2/.hoodie/hoodie.properties 24/04/08 23:22:03 INFO S3NativeFileSystem: Opening 's3://mansipp-emr-dev/hudi_cli_migration/tables/mor/mansipp_hudi_mor_table_2/.hoodie/hoodie.properties' for reading 24/04/08 23:22:03 INFO HoodieTableMetaClient: Finished Loading Table of type MERGE_ON_READ(version=1, baseFileFormat=PARQUET) from s3://mansipp-emr-dev/hudi_cli_migration/tables/mor/mansipp_hudi_mor_table_2/ Commit 20240408231846380 failed to roll back``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch asf-site updated: [DOCS] Updates slack link across site (#10981)
This is an automated email from the ASF dual-hosted git repository. bhavanisudha pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new a4ec3fc9016 [DOCS] Updates slack link across site (#10981) a4ec3fc9016 is described below commit a4ec3fc90168229b8d76dfd95b453d9da66cca36 Author: Bhavani Sudha Saktheeswaran <2179254+bhasu...@users.noreply.github.com> AuthorDate: Mon Apr 8 15:35:46 2024 -0700 [DOCS] Updates slack link across site (#10981) --- ...021-12-16-lakehouse-concurrency-control-are-we-too-optimistic.md | 2 +- website/blog/2022-01-06-apache-hudi-2021-a-year-in-review.md| 4 ++-- .../2022-01-14-change-data-capture-with-debezium-and-apache-hudi.md | 2 +- website/blog/2022-12-29-Apache-Hudi-2022-A-Year-In-Review.md| 6 +++--- website/blog/2023-12-28-apache-hudi-2023-a-year-in-review.md| 2 +- website/community/get-involved.md | 2 +- website/docs/overview.md| 4 ++-- website/docusaurus.config.js| 4 ++-- website/i18n/cn/docusaurus-plugin-content-pages/get-involved.md | 2 +- website/i18n/cn/docusaurus-theme-classic/footer.json| 2 +- website/sidebars.js | 2 +- website/sidebarsCommunity.js| 2 +- website/sidebarsContribute.js | 2 +- website/src/components/JoinCommunity/index.js | 2 +- website/src/pages/powered-by.md | 2 +- website/versioned_docs/version-0.10.0/overview.md | 4 ++-- website/versioned_docs/version-0.10.1/overview.md | 4 ++-- website/versioned_docs/version-0.11.0/overview.md | 4 ++-- website/versioned_docs/version-0.11.1/overview.md | 4 ++-- website/versioned_docs/version-0.12.0/overview.md | 4 ++-- website/versioned_docs/version-0.12.1/overview.md | 4 ++-- website/versioned_docs/version-0.12.2/overview.md | 4 ++-- website/versioned_docs/version-0.12.3/overview.md | 4 ++-- website/versioned_docs/version-0.13.0/overview.md | 4 ++-- website/versioned_docs/version-0.13.1/overview.md | 4 ++-- website/versioned_docs/version-0.14.0/overview.md | 4 ++-- website/versioned_docs/version-0.14.1/overview.md | 4 ++-- website/versioned_sidebars/version-0.10.0-sidebars.json | 2 +- website/versioned_sidebars/version-0.10.1-sidebars.json | 2 +- website/versioned_sidebars/version-0.11.0-sidebars.json | 2 +- website/versioned_sidebars/version-0.11.1-sidebars.json | 2 +- website/versioned_sidebars/version-0.12.0-sidebars.json | 2 +- website/versioned_sidebars/version-0.12.1-sidebars.json | 2 +- website/versioned_sidebars/version-0.12.2-sidebars.json | 2 +- website/versioned_sidebars/version-0.12.3-sidebars.json | 2 +- website/versioned_sidebars/version-0.13.0-sidebars.json | 2 +- website/versioned_sidebars/version-0.13.1-sidebars.json | 2 +- website/versioned_sidebars/version-0.14.0-sidebars.json | 2 +- website/versioned_sidebars/version-0.14.1-sidebars.json | 2 +- website/versioned_sidebars/version-0.9.0-sidebars.json | 2 +- 40 files changed, 57 insertions(+), 57 deletions(-) diff --git a/website/blog/2021-12-16-lakehouse-concurrency-control-are-we-too-optimistic.md b/website/blog/2021-12-16-lakehouse-concurrency-control-are-we-too-optimistic.md index a06b1065601..2d90dea745b 100644 --- a/website/blog/2021-12-16-lakehouse-concurrency-control-are-we-too-optimistic.md +++ b/website/blog/2021-12-16-lakehouse-concurrency-control-are-we-too-optimistic.md @@ -54,4 +54,4 @@ All this said, there are still many ways we can improve upon this foundation. * While optimistic concurrency control is attractive when serializable snapshot isolation is desired, it's neither optimal nor the only method for dealing with concurrency between writers. We plan to implement a fully lock-free concurrency control using CRDTs and widely adopted stream processing concepts, over our log [merge API](https://github.com/apache/hudi/blob/bc8bf043d5512f7afbb9d94882c4e43ee61d6f06/hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecordPayload.java#L [...] * Touching upon key constraints, Hudi is the only lake transactional layer that ensures unique [key](https://hudi.apache.org/docs/key_generation) constraints today, but limited to the record key of the table. We will be looking to expand this capability in a more general form to non-primary key
Re: [PR] [DOCS] Updates slack link across site [hudi]
bhasudha merged PR #10981: URL: https://github.com/apache/hudi/pull/10981 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7503] Compaction and LogCompaction executions should start a heartbeat on every attempt and block concurrent executions of same plan [hudi]
hudi-bot commented on PR #10965: URL: https://github.com/apache/hudi/pull/10965#issuecomment-2043740065 ## CI report: * c41af6435281865147967768419da5e4fb688f8b Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23153) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [DOCS] Updates slack link across site [hudi]
bhasudha commented on PR #10981: URL: https://github.com/apache/hudi/pull/10981#issuecomment-2043726886 Tested locally ![Screenshot 2024-04-08 at 3 06 27 PM](https://github.com/apache/hudi/assets/2179254/9070ea06-7658-4f85-a627-10339de6051c) ![Screenshot 2024-04-08 at 3 05 08 PM](https://github.com/apache/hudi/assets/2179254/88681875-0166-43d1-a69f-e27508506c7c) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [DOCS] Updates slack link across site [hudi]
bhasudha opened a new pull request, #10981: URL: https://github.com/apache/hudi/pull/10981 ### Change Logs Update slack link due to expiry of old one. ### Impact Slack link update across website. ### Risk level (write none, low medium or high below) low. site update. ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none"._ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-6787) Hive Integrate FileGroupReader with HoodieMergeOnReadSnapshotReader and RealtimeCompactedRecordReader for Hive
[ https://issues.apache.org/jira/browse/HUDI-6787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17835052#comment-17835052 ] Jonathan Vexler commented on HUDI-6787: --- {code:java} root@adhoc-2:/opt# spark-submit \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer > $HUDI_UTILITIES_BUNDLE \ > --table-type COPY_ON_WRITE \ > --source-class org.apache.hudi.utilities.sources.JsonKafkaSource \ > --source-ordering-field ts \ > --target-base-path /user/hive/warehouse/stock_ticks_cow \ > --target-table stock_ticks_cow --props > /var/demo/config/kafka-source.properties \ > --schemaprovider-class > org.apache.hudi.utilities.schema.FilebasedSchemaProvider 2024-04-08 21:13:35,067 WARN streamer.SchedulerConfGenerator: Job Scheduling Configs will not be in effect as spark.scheduler.mode is not set to FAIR at instantiation time. Continuing without scheduling configs 2024-04-08 21:13:35,211 INFO spark.SparkContext: Running Spark version 3.2.1 2024-04-08 21:13:35,247 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2024-04-08 21:13:35,346 INFO resource.ResourceUtils: == 2024-04-08 21:13:35,347 INFO resource.ResourceUtils: No custom resources configured for spark.driver. 2024-04-08 21:13:35,347 INFO resource.ResourceUtils: == 2024-04-08 21:13:35,348 INFO spark.SparkContext: Submitted application: streamer-stock_ticks_cow 2024-04-08 21:13:35,383 INFO resource.ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0) 2024-04-08 21:13:35,396 INFO resource.ResourceProfile: Limiting resource is cpu 2024-04-08 21:13:35,396 INFO resource.ResourceProfileManager: Added ResourceProfile id: 0 2024-04-08 21:13:35,461 INFO spark.SecurityManager: Changing view acls to: root 2024-04-08 21:13:35,461 INFO spark.SecurityManager: Changing modify acls to: root 2024-04-08 21:13:35,462 INFO spark.SecurityManager: Changing view acls groups to: 2024-04-08 21:13:35,462 INFO spark.SecurityManager: Changing modify acls groups to: 2024-04-08 21:13:35,463 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set() 2024-04-08 21:13:35,512 INFO Configuration.deprecation: mapred.output.compression.codec is deprecated. Instead, use mapreduce.output.fileoutputformat.compress.codec 2024-04-08 21:13:35,513 INFO Configuration.deprecation: mapred.output.compress is deprecated. Instead, use mapreduce.output.fileoutputformat.compress 2024-04-08 21:13:35,513 INFO Configuration.deprecation: mapred.output.compression.type is deprecated. Instead, use mapreduce.output.fileoutputformat.compress.type 2024-04-08 21:13:35,750 INFO util.Utils: Successfully started service 'sparkDriver' on port 42169. 2024-04-08 21:13:35,789 INFO spark.SparkEnv: Registering MapOutputTracker 2024-04-08 21:13:35,826 INFO spark.SparkEnv: Registering BlockManagerMaster 2024-04-08 21:13:35,848 INFO storage.BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information 2024-04-08 21:13:35,850 INFO storage.BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up 2024-04-08 21:13:35,856 INFO spark.SparkEnv: Registering BlockManagerMasterHeartbeat 2024-04-08 21:13:35,879 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-2e2fda2c-c1b4-4198-b790-58c00db5af27 2024-04-08 21:13:35,900 INFO memory.MemoryStore: MemoryStore started with capacity 366.3 MiB 2024-04-08 21:13:35,915 INFO spark.SparkEnv: Registering OutputCommitCoordinator 2024-04-08 21:13:36,009 INFO util.log: Logging initialized @2972ms to org.sparkproject.jetty.util.log.Slf4jLog 2024-04-08 21:13:36,135 INFO server.Server: jetty-9.4.43.v20210629; built: 2021-06-30T11:07:22.254Z; git: 526006ecfa3af7f1a27ef3a288e2bef7ea9dd7e8; jvm 1.8.0_212-b04 2024-04-08 21:13:36,162 INFO server.Server: Started @3125ms 2024-04-08 21:13:36,198 INFO server.AbstractConnector: Started ServerConnector@3e681bc{HTTP/1.1, (http/1.1)}{0.0.0.0:8090} 2024-04-08 21:13:36,199 INFO util.Utils: Successfully started service 'SparkUI' on port 8090. 2024-04-08 21:13:36,241 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@55b62629{/jobs,null,AVAILABLE,@Spark} 2024-04-08 21:13:36,244 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@15f193b8{/jobs/json,null,AVAILABLE,@Spark} 2024-04-08 21:13:36,245 INFO
Re: [I] [SUPPORT]insert_overwrite_table table slow [hudi]
wkhappy1 commented on issue #10979: URL: https://github.com/apache/hudi/issues/10979#issuecomment-2043650074 @ad1happy2go yes,table size is 27.1 G that is a the hudi table in hdfs ,and I find a rdd cache on disk size is 503.8 from spark ui.can the rdd size cached be small?it seems to bigger -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7503] Compaction and LogCompaction executions should start a heartbeat on every attempt and block concurrent executions of same plan [hudi]
hudi-bot commented on PR #10965: URL: https://github.com/apache/hudi/pull/10965#issuecomment-2043606333 ## CI report: * c8e268903a19c7ecc5cd927fd8afa3332a1c3aea Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23133) * c41af6435281865147967768419da5e4fb688f8b Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23153) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7503] Compaction and LogCompaction executions should start a heartbeat on every attempt and block concurrent executions of same plan [hudi]
kbuci commented on code in PR #10965: URL: https://github.com/apache/hudi/pull/10965#discussion_r1556395911 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java: ## @@ -1135,8 +1138,34 @@ protected void completeLogCompaction(HoodieCommitMetadata metadata, HoodieTable */ protected HoodieWriteMetadata compact(String compactionInstantTime, boolean shouldComplete) { HoodieTable table = createTable(config, context.getHadoopConf().get()); +Option instantToCompactOption = Option.fromJavaOptional(table.getActiveTimeline() +.filterCompletedAndCompactionInstants() +.getInstants() +.stream() +.filter(instant -> HoodieActiveTimeline.EQUALS.test(instant.getTimestamp(), compactionInstantTime)) +.findFirst()); +try { + // Transaction serves to ensure only one compact job for this instant will start heartbeat, and any other concurrent + // compact job will abort if they attempt to execute compact before heartbeat expires + // Note that as long as all jobs for this table use this API for compact, then this alone should prevent + // compact rollbacks from running concurrently to compact commits. + txnManager.beginTransaction(instantToCompactOption, txnManager.getLastCompletedTransactionOwner()); + try { +if (!this.heartbeatClient.isHeartbeatExpired(compactionInstantTime)) { + throw new HoodieLockException("Cannot compact instant " + compactionInstantTime + " due to heartbeat by existing job"); +} + } catch (IOException e) { +throw new HoodieHeartbeatException("Error accessing heartbeat of instant to compact " + compactionInstantTime, e); + } + this.heartbeatClient.start(compactionInstantTime); +} finally { + txnManager.endTransaction(txnManager.getCurrentTransactionOwner()); +} preWrite(compactionInstantTime, WriteOperationType.COMPACT, table.getMetaClient()); -return tableServiceClient.compact(compactionInstantTime, shouldComplete); +HoodieWriteMetadata compactMetadata = tableServiceClient.compact(compactionInstantTime, shouldComplete); +this.heartbeatClient.stop(compactionInstantTime, true); Review Comment: I was looking into a UT failure in `org.apache.hudi.table.functional.TestHoodieSparkMergeOnReadTableInsertUpdateDelete#testRepeatedRollbackOfCompaction` where two compact executions of the same instant time are called back to back (my understanding is that this is supposed to verify that the second compact does a no-op and succeeds upon seeing that plan is already committed). I realized that with this change, the second compact call was failing due to calling `isHeartbeatExpired` and seeing an active heartbeat (from the first attempt) still running, despite the fact that here we are stopping the heartbeat after a successfully completing the compact. The reason that `isHeartbeatExpired` was unexpectedly `false` here is that 1. `isHeartbeatExpired` will return false if instant time is too recent, even if the heartbeat has been stopped (in the in-memory mapping) 2. When `org.apache.hudi.client.heartbeat.HoodieHeartbeatClient#stop(java.lang.String)` is called (by the first compact call in UT) the heartbeat file is deleted and the heartbeat in in-memory mapping is stopped (as expected). But this means that the heartbeat cannot be started again (even if (1) is resolved), since heartbeat API doesn't allow caller to start a heartbeat that is present in in-memory mapping and has heartbeatStopped flag set to true. In order to get around this issue, I added another API in heartbeat API similar to stop, except that it removes the desired heartbeat from the in-memory mapping (forcing any future compact call in the same job to re-read the heartbeat files from DFS and create a new heartbeat in the in-memory mapping ). Though not sure if there might be a better approach here. I assume this existing functionality isn't a bug, as it makes sense for commits that cannot be repeatedly re-executed (like ingestion COMMITs). I wonder if the reason that this is causing an issue here stems from the fact that for compact we need to potentially repeatedly restart stopped heartbeats, and the hearttbeat API might not have been intended for this use case? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7503] Compaction and LogCompaction executions should start a heartbeat on every attempt and block concurrent executions of same plan [hudi]
hudi-bot commented on PR #10965: URL: https://github.com/apache/hudi/pull/10965#issuecomment-2043593538 ## CI report: * c8e268903a19c7ecc5cd927fd8afa3332a1c3aea Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23133) * c41af6435281865147967768419da5e4fb688f8b UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7290] Don't assume ReplaceCommits are always Clustering [hudi]
bvaradar commented on code in PR #10479: URL: https://github.com/apache/hudi/pull/10479#discussion_r1556399110 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/marker/WriteMarkers.java: ## @@ -86,7 +86,7 @@ public Option create(String partitionPath, String fileName, IOType type, H HoodieTimeline pendingReplaceTimeline = activeTimeline.filterPendingReplaceTimeline(); // TODO If current is compact or clustering then create marker directly without early conflict detection. // Need to support early conflict detection between table service and common writers. - if (pendingCompactionTimeline.containsInstant(instantTime) || pendingReplaceTimeline.containsInstant(instantTime)) { Review Comment: @jonvex : Wouldn't this cause extra compaction plan read at each writing task level ? Instead, can you see if we can pass this information from the driver ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7290] Don't assume ReplaceCommits are always Clustering [hudi]
bvaradar commented on code in PR #10479: URL: https://github.com/apache/hudi/pull/10479#discussion_r1556399888 ## hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieDefaultTimeline.java: ## @@ -516,13 +516,40 @@ public Option getLastClusteringInstant() { .findFirst()); } + @Override + public Option getFirstPendingClusterInstant() { +return getLastOrFirstPendingClusterInstant(false); + } + @Override public Option getLastPendingClusterInstant() { -return Option.fromJavaOptional(filterPendingReplaceTimeline() -.getReverseOrderedInstants() +return getLastOrFirstPendingClusterInstant(true); + } + + protected Option getLastOrFirstPendingClusterInstant(boolean getLast) { Review Comment: Make this private. Rename getLast to isLast -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7503] Compaction and LogCompaction executions should start a heartbeat on every attempt and block concurrent executions of same plan [hudi]
kbuci commented on code in PR #10965: URL: https://github.com/apache/hudi/pull/10965#discussion_r1556395911 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java: ## @@ -1135,8 +1138,34 @@ protected void completeLogCompaction(HoodieCommitMetadata metadata, HoodieTable */ protected HoodieWriteMetadata compact(String compactionInstantTime, boolean shouldComplete) { HoodieTable table = createTable(config, context.getHadoopConf().get()); +Option instantToCompactOption = Option.fromJavaOptional(table.getActiveTimeline() +.filterCompletedAndCompactionInstants() +.getInstants() +.stream() +.filter(instant -> HoodieActiveTimeline.EQUALS.test(instant.getTimestamp(), compactionInstantTime)) +.findFirst()); +try { + // Transaction serves to ensure only one compact job for this instant will start heartbeat, and any other concurrent + // compact job will abort if they attempt to execute compact before heartbeat expires + // Note that as long as all jobs for this table use this API for compact, then this alone should prevent + // compact rollbacks from running concurrently to compact commits. + txnManager.beginTransaction(instantToCompactOption, txnManager.getLastCompletedTransactionOwner()); + try { +if (!this.heartbeatClient.isHeartbeatExpired(compactionInstantTime)) { + throw new HoodieLockException("Cannot compact instant " + compactionInstantTime + " due to heartbeat by existing job"); +} + } catch (IOException e) { +throw new HoodieHeartbeatException("Error accessing heartbeat of instant to compact " + compactionInstantTime, e); + } + this.heartbeatClient.start(compactionInstantTime); +} finally { + txnManager.endTransaction(txnManager.getCurrentTransactionOwner()); +} preWrite(compactionInstantTime, WriteOperationType.COMPACT, table.getMetaClient()); -return tableServiceClient.compact(compactionInstantTime, shouldComplete); +HoodieWriteMetadata compactMetadata = tableServiceClient.compact(compactionInstantTime, shouldComplete); +this.heartbeatClient.stop(compactionInstantTime, true); Review Comment: I was looking into a UT failure in `org.apache.hudi.table.functional.TestHoodieSparkMergeOnReadTableInsertUpdateDelete#testRepeatedRollbackOfCompaction` where two compact executions of the same instant time are called back to back (my understanding is that this is supposed to verify that the second compact does a no-op and succeeds upon seeing that plan is already committed). I realized that with this change, the second compact call was failing due to calling `isHeartbeatExpired` and seeing an active heartbeat (from the first attempt) still running, despite the fact that here we are stopping the heartbeat after a successfully completing the compact. The reason that `isHeartbeatExpired` was unexpectedly `false` here is that 1. `isHeartbeatExpired` will return false if instant time is too recent, even if the heartbeat has been stopped (in the in-memory mapping) 2. When `org.apache.hudi.client.heartbeat.HoodieHeartbeatClient#stop(java.lang.String)` is called (by the first compact call in UT) the heartbeat file is deleted and the heartbeat in in-memory mapping is stopped (as expected). But this means that the heartbeat cannot be started again (even if (1) is resolved), since heartbeat API doesn't allow caller to start a heartbeat that is present in in-memory mapping and has heartbeatStopped flag set to true. In order to get around this issue, I added another API in heartbeat API similar to stop, except that it removes the desired heartbeat from the in-memory mapping (forcing any future compact call in the same job to re-read the heartbeat files from DFS and create a new heartbeat in the in-memory mapping ). Though not sure if there might be a better approach here. I assume this existing functionality isn't a bug, as it makes sense for commits that cannot be repeatedly re-executed (like ingestion COMMITs), and I assume the issue here stems from the fact that for compact we need to potentially repeatedly restart stopped heartbeats -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7290] Don't assume ReplaceCommits are always Clustering [hudi]
hudi-bot commented on PR #10479: URL: https://github.com/apache/hudi/pull/10479#issuecomment-2043517085 ## CI report: * b9b3ae4c3025515e61eca8a7df887eb9fe764b0f Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23151) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7290] Don't assume ReplaceCommits are always Clustering [hudi]
hudi-bot commented on PR #10479: URL: https://github.com/apache/hudi/pull/10479#issuecomment-2043429116 ## CI report: * 0a5e5faa01273113cb974e9aa31cfb54d62dff67 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23150) * b9b3ae4c3025515e61eca8a7df887eb9fe764b0f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23151) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7290] Don't assume ReplaceCommits are always Clustering [hudi]
hudi-bot commented on PR #10479: URL: https://github.com/apache/hudi/pull/10479#issuecomment-2043418339 ## CI report: * 0a5e5faa01273113cb974e9aa31cfb54d62dff67 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23150) * b9b3ae4c3025515e61eca8a7df887eb9fe764b0f UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]insert_overwrite_table table slow [hudi]
ad1happy2go commented on issue #10979: URL: https://github.com/apache/hudi/issues/10979#issuecomment-2043307873 @wkhappy1 As you said the table size is 27.1 G, is it parquet table? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7290] Don't assume ReplaceCommits are always Clustering [hudi]
jonvex commented on PR #10479: URL: https://github.com/apache/hudi/pull/10479#issuecomment-2043297073 @bvaradar org.apache.hudi.common.table.view.TestHoodieTableFileSystemView#testHoodieTableFileSystemViewWithPendingClustering is failing because that test relies on this feature to be broken -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7503] Compaction and LogCompaction executions should start a heartbeat on every attempt and block concurrent executions of same plan [hudi]
kbuci commented on code in PR #10965: URL: https://github.com/apache/hudi/pull/10965#discussion_r1556157237 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java: ## @@ -1135,8 +1138,36 @@ protected void completeLogCompaction(HoodieCommitMetadata metadata, HoodieTable */ protected HoodieWriteMetadata compact(String compactionInstantTime, boolean shouldComplete) { HoodieTable table = createTable(config, context.getHadoopConf().get()); +Option instantToCompactOption = Option.fromJavaOptional(table.getActiveTimeline() +.filterCompletedAndCompactionInstants() +.getInstants() +.stream() +.filter(instant -> HoodieActiveTimeline.EQUALS.test(instant.getTimestamp(), compactionInstantTime)) +.findFirst()); +try { + // Transaction serves to ensure only one compact job for this instant will start heartbeat, and any other concurrent + // compact job will abort if they attempt to execute compact before heartbeat expires + // Note that as long as all jobs for this table use this API for compact, then this alone should prevent + // compact rollbacks from running concurrently to compact commits. + txnManager.beginTransaction(instantToCompactOption, txnManager.getLastCompletedTransactionOwner()); Review Comment: > 1.2) if the heartbeat does not expire, just can the execution of this run and log a wanning log there. Just to clarify, do you mean throwing an exception in this run? Since I'm not sure if we can make the current run a no-op if a concurrent heartbeat is detected, since a) I'm not sure what HoodieWriteMetadata value to return if we make this a no-op b) If we don't explicitly throw an exception and fail, then the caller will assume compaction happened succesfully or already happened. This wouldn't be a correct assumption, since the other concurrent writer (that is currently executing this compact plan) may either fail or take a long time to finish. > if the state if still REQUESTED, we can execute it direcly? Ah that's a good point, it might actually be safe for two jobs to execute a compact plan at same time as long as neither of them are doing a roll back. Despite that though, I don't think it's safe to skip acquiring+starting heartbeat even if the compact plan only has .requested , since the following (unlikely) scenario can happen still: 1. Table has a compact plan C.requested created in timeline 2. Job (A) calls compact on C. It starts a heartbeat of C and then starts executing C 3. Job (B) calls compact on C. Although it sees a heartbeat for C, since C has no C.inflight it starts executing C 4. Job (A) and/or Job (B) create a C.inflight 5. Job (A) fails. 6. Heartbeat that Job (A) created expires 7. Job (C) calls compact on C. It sees that there is a C.inflight and no heartbeat (because Job (B) did not start any heartbeat). Therefore, it starts executing C, and rolls back the existing C.inflight ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java: ## @@ -1135,8 +1138,36 @@ protected void completeLogCompaction(HoodieCommitMetadata metadata, HoodieTable */ protected HoodieWriteMetadata compact(String compactionInstantTime, boolean shouldComplete) { HoodieTable table = createTable(config, context.getHadoopConf().get()); +Option instantToCompactOption = Option.fromJavaOptional(table.getActiveTimeline() +.filterCompletedAndCompactionInstants() +.getInstants() +.stream() +.filter(instant -> HoodieActiveTimeline.EQUALS.test(instant.getTimestamp(), compactionInstantTime)) +.findFirst()); +try { + // Transaction serves to ensure only one compact job for this instant will start heartbeat, and any other concurrent + // compact job will abort if they attempt to execute compact before heartbeat expires + // Note that as long as all jobs for this table use this API for compact, then this alone should prevent + // compact rollbacks from running concurrently to compact commits. + txnManager.beginTransaction(instantToCompactOption, txnManager.getLastCompletedTransactionOwner()); Review Comment: > 1.2) if the heartbeat does not expire, just can the execution of this run and log a wanning log there. Just to clarify, do you mean throwing an exception in this run? Since I'm not sure if we can make the current run a no-op if a concurrent heartbeat is detected, since a) I'm not sure what HoodieWriteMetadata value to return if we make this a no-op b) If we don't explicitly throw an exception and fail, then the caller will assume compaction happened succesfully or already happened. This wouldn't be a correct assumption, since the other concurrent writer (that is currently executing this compact plan) may either fail or take a long time
Re: [PR] [HUDI-7290] Don't assume ReplaceCommits are always Clustering [hudi]
hudi-bot commented on PR #10479: URL: https://github.com/apache/hudi/pull/10479#issuecomment-2043184042 ## CI report: * 0a5e5faa01273113cb974e9aa31cfb54d62dff67 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23150) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (HUDI-6330) Update user document to introduce this feature
[ https://issues.apache.org/jira/browse/HUDI-6330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu reassigned HUDI-6330: Assignee: Jing Zhang > Update user document to introduce this feature > -- > > Key: HUDI-6330 > URL: https://issues.apache.org/jira/browse/HUDI-6330 > Project: Apache Hudi > Issue Type: Sub-task > Components: docs, flink >Reporter: Jing Zhang >Assignee: Jing Zhang >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-6330) Update user document to introduce this feature
[ https://issues.apache.org/jira/browse/HUDI-6330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834990#comment-17834990 ] Raymond Xu commented on HUDI-6330: -- [~jingzhang] thanks and merged! > Update user document to introduce this feature > -- > > Key: HUDI-6330 > URL: https://issues.apache.org/jira/browse/HUDI-6330 > Project: Apache Hudi > Issue Type: Sub-task > Components: docs, flink >Reporter: Jing Zhang >Assignee: Jing Zhang >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-6330) Update user document to introduce this feature
[ https://issues.apache.org/jira/browse/HUDI-6330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu closed HUDI-6330. Resolution: Fixed > Update user document to introduce this feature > -- > > Key: HUDI-6330 > URL: https://issues.apache.org/jira/browse/HUDI-6330 > Project: Apache Hudi > Issue Type: Sub-task > Components: docs, flink >Reporter: Jing Zhang >Assignee: Jing Zhang >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
(hudi) branch asf-site updated: [HUDI-6330][DOCS] Update user doc to show how to use consistent bucket index for Flink engine (#10977)
This is an automated email from the ASF dual-hosted git repository. xushiyan pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new 72b01a53d3d [HUDI-6330][DOCS] Update user doc to show how to use consistent bucket index for Flink engine (#10977) 72b01a53d3d is described below commit 72b01a53d3d22a51e9210b4b69f368e7388821e4 Author: Jing Zhang AuthorDate: Tue Apr 9 00:18:41 2024 +0800 [HUDI-6330][DOCS] Update user doc to show how to use consistent bucket index for Flink engine (#10977) --- website/docs/sql_dml.md| 80 -- website/releases/release-0.14.0.md | 4 +- 2 files changed, 78 insertions(+), 6 deletions(-) diff --git a/website/docs/sql_dml.md b/website/docs/sql_dml.md index 90576dcb0e0..edb63730b13 100644 --- a/website/docs/sql_dml.md +++ b/website/docs/sql_dml.md @@ -323,12 +323,15 @@ In the below example, we have two streaming ingestion pipelines that concurrentl pipeline is responsible for the compaction and cleaning table services, while the other pipeline is just for data ingestion. -```sql +In order to commit the dataset, the checkpoint needs to be enabled, here is an example configuration for a flink-conf.yaml: +```yaml -- set the interval as 30 seconds execution.checkpointing.interval: 3 state.backend: rocksdb +``` --- This is a datagen source that can generates records continuously +```sql +-- This is a datagen source that can generate records continuously CREATE TABLE sourceT ( uuid varchar(20), name varchar(10), @@ -349,7 +352,7 @@ CREATE TABLE t1( `partition` varchar(20) ) WITH ( 'connector' = 'hudi', -'path' = '/Users/chenyuzhao/workspace/hudi-demo/t1', +'path' = '${work_path}/hudi-demo/t1', 'table.type' = 'MERGE_ON_READ', 'index.type' = 'BUCKET', 'hoodie.write.concurrency.mode' = 'NON_BLOCKING_CONCURRENCY_CONTROL', @@ -365,7 +368,7 @@ CREATE TABLE t1_2( `partition` varchar(20) ) WITH ( 'connector' = 'hudi', -'path' = '/Users/chenyuzhao/workspace/hudi-demo/t1', +'path' = '${work_path}/hudi-demo/t1', 'table.type' = 'MERGE_ON_READ', 'index.type' = 'BUCKET', 'hoodie.write.concurrency.mode' = 'NON_BLOCKING_CONCURRENCY_CONTROL', @@ -390,3 +393,72 @@ and `clean.async.enabled` options are used to disable the compaction and cleanin This is done to ensure that the compaction and cleaning services are not executed twice for the same table. +### Consistent hashing index (Experimental) + +We have introduced the Consistent Hashing Index since [0.13.0 release](/releases/release-0.13.0#consistent-hashing-index). In comparison to the static hashing index ([Bucket Index](/releases/release-0.11.0#bucket-index)), the consistent hashing index offers dynamic scalability of data buckets for the writer. +You can find the [RFC](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) for the design of this feature. +In the 0.13.X release, the Consistent Hashing Index is supported only for Spark engine. And since [release 0.14.0](/releases/release-0.14.0#consistent-hashing-index-support), the index is supported for Flink engine. + +To utilize this feature, configure the option `index.type` as `BUCKET` and set `hoodie.index.bucket.engine` to `CONSISTENT_HASHING`. +When enabling the consistent hashing index, it's important to enable clustering scheduling within the writer. During this process, the writer will perform dual writes for both the old and new data buckets while the clustering is pending. Although the dual write does not impact correctness, it is strongly recommended to execute clustering as quickly as possible. + +In the below example, we will create a datagen source and do streaming ingestion into Hudi table with consistent bucket index. In order to commit the dataset, the checkpoint needs to be enabled, here is an example configuration for a flink-conf.yaml: +```yaml +-- set the interval as 30 seconds +execution.checkpointing.interval: 3 +state.backend: rocksdb +``` + +```sql +-- This is a datagen source that can generate records continuously +CREATE TABLE sourceT ( +uuid varchar(20), +name varchar(10), +age int, +ts timestamp(3), +`partition` as 'par1' +) WITH ( +'connector' = 'datagen', +'rows-per-second' = '200' +); + +-- Create the hudi table with consistent bucket index +CREATE TABLE t1( +uuid VARCHAR(20) PRIMARY KEY NOT ENFORCED, +name VARCHAR(10), +age INT, +ts TIMESTAMP(3), +`partition` VARCHAR(20) +) +PARTITIONED BY (`partition`) +WITH ( +'connector'='hudi', +'path' = '${work_path}/hudi-demo/hudiT', +'table.type' = 'MERGE_ON_READ', +'index.type' = 'BUCKET', +'clustering.schedule.enabled'='true', +'hoodie.index.bucket.engine'='CONSISTENT_HASHING', +
Re: [PR] [HUDI-6330][DOCS] Update user doc to show how to use consistent bucket index for Flink engine [hudi]
xushiyan merged PR #10977: URL: https://github.com/apache/hudi/pull/10977 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7290] Don't assume ReplaceCommits are always Clustering [hudi]
hudi-bot commented on PR #10479: URL: https://github.com/apache/hudi/pull/10479#issuecomment-2043093848 ## CI report: * 52afba2aa7c6ec4e0f8ca0f50eaf4a0639c53432 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21909) * 85e5016a10f9908c8116cd950dc46bbf74a8a558 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23149) * 0a5e5faa01273113cb974e9aa31cfb54d62dff67 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7290] Don't assume ReplaceCommits are always Clustering [hudi]
hudi-bot commented on PR #10479: URL: https://github.com/apache/hudi/pull/10479#issuecomment-2043078181 ## CI report: * 52afba2aa7c6ec4e0f8ca0f50eaf4a0639c53432 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21909) * 85e5016a10f9908c8116cd950dc46bbf74a8a558 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23149) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch asf-site updated: [DOCS] Update blogs (#10971)
This is an automated email from the ASF dual-hosted git repository. xushiyan pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new 06eb97ca409 [DOCS] Update blogs (#10971) 06eb97ca409 is described below commit 06eb97ca4093dc2069eadd008392c71e3a15ef90 Author: Bhavani Sudha Saktheeswaran <2179254+bhasu...@users.noreply.github.com> AuthorDate: Mon Apr 8 08:35:40 2024 -0700 [DOCS] Update blogs (#10971) --- ...olutionary-journey-of-upstoxs-data-platform.mdx | 17 ...-Modern-Datalakes-with-Hudi--MinIO--and-HMS.mdx | 20 +++ ...3-22-data-lake-cost-optimisation-strategies.mdx | 22 + ...able-formats-apache-iceberg-and-apache-hudi.mdx | 19 ++ ...dexing-apache-hudi-delivers-70-faster-point.mdx | 19 ++ ...reading-data-from-hudi-tables-joining-delta.mdx | 21 ...olutionary-journey-of-upstoxs-data-platform.png | Bin 0 -> 454009 bytes ...-Modern-Datalakes-with-Hudi--MinIO--and-HMS.jpg | Bin 0 -> 55802 bytes ...3-22-data-lake-cost-optimisation-strategies.png | Bin 0 -> 202437 bytes ...able-formats-apache-iceberg-and-apache-hudi.png | Bin 0 -> 488494 bytes ...dexing-apache-hudi-delivers-70-faster-point.png | Bin 0 -> 139221 bytes ...reading-data-from-hudi-tables-joining-delta.png | Bin 0 -> 92968 bytes 12 files changed, 118 insertions(+) diff --git a/website/blog/2024-03-10-navigating-the-future-the-evolutionary-journey-of-upstoxs-data-platform.mdx b/website/blog/2024-03-10-navigating-the-future-the-evolutionary-journey-of-upstoxs-data-platform.mdx new file mode 100644 index 000..ac2a5a2ad3f --- /dev/null +++ b/website/blog/2024-03-10-navigating-the-future-the-evolutionary-journey-of-upstoxs-data-platform.mdx @@ -0,0 +1,17 @@ +--- +title: "Navigating the Future: The Evolutionary Journey of Upstox’s Data Platform" +author: Manish Gaurav +category: blog +image: /assets/images/blog/2024-03-10-navigating-the-future-the-evolutionary-journey-of-upstoxs-data-platform.png +tags: +- use-case +- apache hudi +- upstox-engineering +--- + + + +import Redirect from '@site/src/components/Redirect'; + +https://medium.com/upstox-engineering/navigating-the-future-the-evolutionary-journey-of-upstoxs-data-platform-92dc10ff22ae;>Redirecting... please wait!! + diff --git a/website/blog/2024-03-14-Modern-Datalakes-with-Hudi--MinIO--and-HMS.mdx b/website/blog/2024-03-14-Modern-Datalakes-with-Hudi--MinIO--and-HMS.mdx new file mode 100644 index 000..915b2426f0d --- /dev/null +++ b/website/blog/2024-03-14-Modern-Datalakes-with-Hudi--MinIO--and-HMS.mdx @@ -0,0 +1,20 @@ +--- +title: "Modern Datalakes with Hudi, MinIO, and HMS" +author: Brenna Buuck +category: blog +image: /assets/images/blog/2024-03-14-Modern-Datalakes-with-Hudi--MinIO--and-HMS.jpg +tags: +- blog +- apache hudi +- minio +- hms +- hive metastore +- min +--- + + + +import Redirect from '@site/src/components/Redirect'; + +https://blog.min.io/datalakes-with-hudi-and-hms/;>Redirecting... please wait!! + diff --git a/website/blog/2024-03-22-data-lake-cost-optimisation-strategies.mdx b/website/blog/2024-03-22-data-lake-cost-optimisation-strategies.mdx new file mode 100644 index 000..351bf85b25c --- /dev/null +++ b/website/blog/2024-03-22-data-lake-cost-optimisation-strategies.mdx @@ -0,0 +1,22 @@ +--- +title: "Cost Optimization Strategies for scalable Data Lakehouse" +author: Suresh Hasundi +category: blog +image: /assets/images/blog/2024-03-22-data-lake-cost-optimisation-strategies.png +tags: +- blog +- apache hudi +- amazon s3 +- amazon emr +- apcache spark +- lakehouse +- cost optimization +- halodoc +--- + + + +import Redirect from '@site/src/components/Redirect'; + +https://blogs.halodoc.io/data-lake-cost-optimisation-strategies/;>Redirecting... please wait!! + diff --git a/website/blog/2024-03-23-options-on-kafka-sink-to-open-table-formats-apache-iceberg-and-apache-hudi.mdx b/website/blog/2024-03-23-options-on-kafka-sink-to-open-table-formats-apache-iceberg-and-apache-hudi.mdx new file mode 100644 index 000..0a3e0050139 --- /dev/null +++ b/website/blog/2024-03-23-options-on-kafka-sink-to-open-table-formats-apache-iceberg-and-apache-hudi.mdx @@ -0,0 +1,19 @@ +--- +title: "Options on Kafka sink to open table Formats: Apache Iceberg and Apache Hudi" +author: Albert Wong +category: blog +image: /assets/images/blog/2024-03-23-options-on-kafka-sink-to-open-table-formats-apache-iceberg-and-apache-hudi.png +tags: +- blog +- apache hudi +- apache iceberg +- apache Kafka +- kafka connect +- starrocks +- devgenius +--- + +import Redirect from '@site/src/components/Redirect'; + +https://blog.devgenius.io/options-on-kafka-sink-to-open-table-formats-apache-iceberg-and-apache-hudi-f6839ddad978;>Redirecting... please wait!! + diff --git
Re: [PR] [DOCS] Update blogs [hudi]
xushiyan merged PR #10971: URL: https://github.com/apache/hudi/pull/10971 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7290] Don't assume ReplaceCommits are always Clustering [hudi]
hudi-bot commented on PR #10479: URL: https://github.com/apache/hudi/pull/10479#issuecomment-2043060978 ## CI report: * 52afba2aa7c6ec4e0f8ca0f50eaf4a0639c53432 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21909) * 85e5016a10f9908c8116cd950dc46bbf74a8a558 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7576] add partitionPath as an instance variable to HoodieBaseFile and HoodieLogFile [hudi]
the-other-tim-brown commented on PR #10975: URL: https://github.com/apache/hudi/pull/10975#issuecomment-2042932190 > > > Can you explain why? > > > > > > Because it represents an "File", the partition notion kind of belongs to table, which is firstly introduced by Hive to resolve the scalability issues. > > Ok why does it contain commit and file group? The logic presented here does not seem to apply to the existing class. This class contains metadata relevant to grouping the files with other relevant files. > > If the issue this is too big a change to introduce, I can look for other options but I think there needs to be some consistency in what is added to these classes. @nsivabalan and @yihua let me know what you would prefer as well. I can also put up a draft where I limit the changes to the `AbstractTableFileSystemView` and some of the supporting utils which will decrease the size of the PR if we want to punt on this discussion -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7575] avoid repeated fetching of pending replace instants [hudi]
the-other-tim-brown commented on code in PR #10976: URL: https://github.com/apache/hudi/pull/10976#discussion_r1555936753 ## hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java: ## @@ -140,6 +141,22 @@ protected void init(HoodieTableMetaClient metaClient, HoodieTimeline visibleActi */ protected void refreshTimeline(HoodieTimeline visibleActiveTimeline) { this.visibleCommitsAndCompactionTimeline = visibleActiveTimeline.getWriteTimeline(); +this.timelineHashAndPendingReplaceInstants = null; + } + + /** + * Get a list of pending replace instants. Caches the result for the active timeline. + * The cache is invalidated when {@link #refreshTimeline(HoodieTimeline)} is called. + * + * @return list of pending replace instant timestamps + */ + private List getPendingReplaceInstants() { +HoodieActiveTimeline activeTimeline = metaClient.getActiveTimeline(); Review Comment: Regarding threading, should we just make this whole method synchronized? Regarding caching, I'm open to whatever seems best. I noticed that the cache in HoodieDefaultTimeline is limited to a single instance and not a global cache. It seems like it may make sense long term to return the same instance whenever possible to benefit from this cache. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]The number of tasks in each distinct stage of building workload profile is always 60 [hudi]
MrAladdin closed issue #10972: [SUPPORT]The number of tasks in each distinct stage of building workload profile is always 60 URL: https://github.com/apache/hudi/issues/10972 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]The number of tasks in each distinct stage of building workload profile is always 60 [hudi]
MrAladdin commented on issue #10972: URL: https://github.com/apache/hudi/issues/10972#issuecomment-2042898188 > @MrAladdin Can you provide the writer configurations you are using?@MrAladdin 你能提供你正在使用的写入器配置吗? sorry, Forget to close "hoodie.metadata.index.async" -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]
hudi-bot commented on PR #10980: URL: https://github.com/apache/hudi/pull/10980#issuecomment-2042891233 ## CI report: * 36b0e8f8e5e00096b9844f8db6cc51cbc114f42c Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23148) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Nested object support in Hudi Table using Flink [hudi]
waytoharish closed issue #10895: Nested object support in Hudi Table using Flink URL: https://github.com/apache/hudi/issues/10895 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Nested object support in Hudi Table using Flink [hudi]
waytoharish commented on issue #10895: URL: https://github.com/apache/hudi/issues/10895#issuecomment-2042890195 Thanks @ad1happy2go @danny0405 its worked for me after using the GenericRowData. I am closing the issue -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7576] add partitionPath as an instance variable to HoodieBaseFile and HoodieLogFile [hudi]
the-other-tim-brown commented on PR #10975: URL: https://github.com/apache/hudi/pull/10975#issuecomment-2042690168 > > Can you explain why? > > Because it represents an "File", the partition notion kind of belongs to table, which is firstly introduced by Hive to resolve the scalability issues. Ok why does it contain commit and file group? The logic presented here does not seem to apply to the existing class. This class contains metadata relevant to grouping the files with other relevant files. If the issue this is too big a change to introduce, I can look for other options but I think there needs to be some consistency in what is added to these classes. @nsivabalan and @yihua let me know what you would prefer as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]insert_overwrite_table table slow [hudi]
wkhappy1 commented on issue #10979: URL: https://github.com/apache/hudi/issues/10979#issuecomment-2042672365 @ad1happy2go input data is a dataframe compute from other tables -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]
hudi-bot commented on PR #10980: URL: https://github.com/apache/hudi/pull/10980#issuecomment-2042643487 ## CI report: * 07e398007c1557d3e17adc3d8a36d8778ed3e976 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23147) * 36b0e8f8e5e00096b9844f8db6cc51cbc114f42c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23148) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]
hudi-bot commented on PR #10980: URL: https://github.com/apache/hudi/pull/10980#issuecomment-2042627787 ## CI report: * 07e398007c1557d3e17adc3d8a36d8778ed3e976 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23147) * 36b0e8f8e5e00096b9844f8db6cc51cbc114f42c UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]
beyond1920 commented on code in PR #10980: URL: https://github.com/apache/hudi/pull/10980#discussion_r1555662848 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java: ## @@ -147,6 +149,13 @@ public HoodieMergeHandle(HoodieWriteConfig config, String instantTime, HoodieTab this.preserveMetadata = true; init(fileId, this.partitionPath, dataFileToBeMerged); validateAndSetAndKeyGenProps(keyGeneratorOpt, config.populateMetaFields()); +// if the old schema equals to the new schema, avoid heavy rewriting +if (config.populateMetaFields() && useWriterSchemaForCompaction) { + LOG.info("Using update instead rewriting during compaction"); + copyOldFunc = (key, record, schema, prop) -> this.updateMetadataToOldRecord(key, record, schema, prop); Review Comment: Not exactly. The behavior is consistent with the old behavior. https://github.com/apache/hudi/assets/1525333/e254eab0-9c22-4658-a4a5-cc8faae9d2af;> https://github.com/apache/hudi/assets/1525333/438c9ee9-1189-4928-9c48-e102625c5967;> In the above pictures, if `config.populateMetaFields() ` is true for compaction job, the `oldSchema` is equals to `writeSchemaWithMetaFields`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6330][DOCS] Update user doc to show how to use consistent bucket index for Flink engine [hudi]
beyond1920 commented on code in PR #10977: URL: https://github.com/apache/hudi/pull/10977#discussion_r1555701329 ## website/docs/sql_dml.md: ## @@ -390,3 +390,70 @@ and `clean.async.enabled` options are used to disable the compaction and cleanin This is done to ensure that the compaction and cleaning services are not executed twice for the same table. +### Consistent hashing index (Experimental) + +We have introduced the Consistent Hashing Index since [0.13.0 release](/releases/release-0.13.0#consistent-hashing-index). In comparison to the static hashing index ([Bucket Index](/releases/release-0.11.0#bucket-index)), the consistent hashing index offers dynamic scalability of data buckets for the writer. +You can find the [RFC](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) for the design of this feature. +In the 0.13.X release, the Consistent Hashing Index is supported only for Spark engine. And since [release 0.14.0](/releases/release-0.14.0#consistent-hashing-index-support), the index is supported for Flink engine. + +In the below example, we have a streaming ingestion pipeline that written to the table with consistent bucket index. +To utilize this feature, configure the option `index.type` as `BUCKET` and set `hoodie.index.bucket.engine` to `CONSISTENT_HASHING`. +When enabling the consistent hashing index, it's important to enable clustering scheduling within the writer. During this process, the writer will perform dual writes for both the old and new data buckets while the clustering is pending. Although the dual write does not impact correctness, it is strongly recommended to execute clustering as quickly as possible. + +```sql +-- set the interval as 30 seconds +execution.checkpointing.interval: 3 +state.backend: rocksdb + +-- This is a datagen source that can generates records continuously +CREATE TABLE sourceT ( +uuid varchar(20), +name varchar(10), +age int, +ts timestamp(3), +`partition` as 'par1' +) WITH ( +'connector' = 'datagen', +'rows-per-second' = '200' +); + +-- Create the hudi table with consistent bucket index +CREATE TABLE t1( +uuid VARCHAR(20) PRIMARY KEY NOT ENFORCED, +name VARCHAR(10), +age INT, +ts TIMESTAMP(3), +`partition` VARCHAR(20) +) +PARTITIONED BY (`partition`) +WITH ( +'connector'='hudi', +'path' = '${work_path}/hudi-demo/hudiT', +'table.type' = 'MERGE_ON_READ', +'index.type' = 'BUCKET', +'clustering.schedule.enabled'='true', +'hoodie.index.bucket.engine'='CONSISTENT_HASHING', + 'hoodie.clustering.plan.strategy.class'='org.apache.hudi.client.clustering.plan.strategy.FlinkConsistentBucketClusteringPlanStrategy', + 'hoodie.clustering.execution.strategy.class'='org.apache.hudi.client.clustering.run.strategy.SparkConsistentBucketClusteringExecutionStrategy', +'hoodie.bucket.index.num.buckets'='8', +'hoodie.bucket.index.max.num.buckets'='128', +'hoodie.bucket.index.min.num.buckets'='8', +'hoodie.bucket.index.split.threshold'='1.5', +'write.tasks'='2' +); + +-- submit the pipelines +insert into t1 select * from sourceT; + +select * from t1 limit 20; +``` + +:::caution +Consistent Hashing Index is supported for Flink engine since [release 0.14.0](/releases/release-0.14.0#consistent-hashing-index-support) and currently there are some limitations to use it as of 0.14.0: + +- This index is supported only for MOR table. This limitation also exists even if using Spark engine. +- It does not work with metadata table enabled. This limitation also exists even if using Spark engine. +- Consistent hashing index does not work with bulk-insert using Flink engine yet, please use simple bucket index or Spark engine for bulk-insert pipelines. +- The resize plan which generated by Flink engine does not support merge small file groups yet, but only support split big file group. +- The resize plan should be executed through an offline Spark job. Review Comment: Flink engine does not support execute resize plan yet. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]insert_overwrite_table table slow [hudi]
ad1happy2go commented on issue #10979: URL: https://github.com/apache/hudi/issues/10979#issuecomment-2042560266 @wkhappy1 What is the format of your input data? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6330][DOCS] Update user doc to show how to use consistent bucket index for Flink engine [hudi]
beyond1920 commented on code in PR #10977: URL: https://github.com/apache/hudi/pull/10977#discussion_r1555696442 ## website/docs/sql_dml.md: ## @@ -390,3 +390,70 @@ and `clean.async.enabled` options are used to disable the compaction and cleanin This is done to ensure that the compaction and cleaning services are not executed twice for the same table. +### Consistent hashing index (Experimental) + +We have introduced the Consistent Hashing Index since [0.13.0 release](/releases/release-0.13.0#consistent-hashing-index). In comparison to the static hashing index ([Bucket Index](/releases/release-0.11.0#bucket-index)), the consistent hashing index offers dynamic scalability of data buckets for the writer. +You can find the [RFC](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) for the design of this feature. +In the 0.13.X release, the Consistent Hashing Index is supported only for Spark engine. And since [release 0.14.0](/releases/release-0.14.0#consistent-hashing-index-support), the index is supported for Flink engine. + +In the below example, we have a streaming ingestion pipeline that written to the table with consistent bucket index. +To utilize this feature, configure the option `index.type` as `BUCKET` and set `hoodie.index.bucket.engine` to `CONSISTENT_HASHING`. +When enabling the consistent hashing index, it's important to enable clustering scheduling within the writer. During this process, the writer will perform dual writes for both the old and new data buckets while the clustering is pending. Although the dual write does not impact correctness, it is strongly recommended to execute clustering as quickly as possible. + +```sql +-- set the interval as 30 seconds +execution.checkpointing.interval: 3 +state.backend: rocksdb + +-- This is a datagen source that can generates records continuously Review Comment: I prefer to add the source table schema here in order to keep the demo complete. Then users could conveniently copy the complete demo and run it in SqlClient. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org