Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-08 Thread via GitHub


hudi-bot commented on PR #10980:
URL: https://github.com/apache/hudi/pull/10980#issuecomment-2044184717

   
   ## CI report:
   
   * c382de2b71540404831449de82e40d9488a38575 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23155)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Duplicate Row in Same Partition using Global Bloom Index [hudi]

2024-04-08 Thread via GitHub


Raghvendradubey commented on issue #9536:
URL: https://github.com/apache/hudi/issues/9536#issuecomment-2044164961

   Hi @ad1happy2go  @nsivabalan  After migrating to new Hudi version 0.14.0 I 
didn't face this issue again, thanks for your support.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] spark stuctrued streaming failed to update MDT metadata [hudi]

2024-04-08 Thread via GitHub


Qiuzhuang commented on issue #10891:
URL: https://github.com/apache/hudi/issues/10891#issuecomment-2044133901

   > but woudn't the inprocess lock provider kick in? and should avoid multiple 
writers to MDT. I am assuming the setup is, spark streaming w/ async compaction 
or clustering. A single process, but multiple thread trying to ingest to MDT. 
if in process lock provider is not kicking in, then its a bug.
   
   If async clustering is in the same process, we don't run into issue for now. 
But for multiple writes like offline clustering in another process, as 
indicated by @danny0405, we should have ZK lock provider to serialize MDT 
write.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-08 Thread via GitHub


hudi-bot commented on PR #10980:
URL: https://github.com/apache/hudi/pull/10980#issuecomment-2044130880

   
   ## CI report:
   
   * 36b0e8f8e5e00096b9844f8db6cc51cbc114f42c Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23148)
 
   * c382de2b71540404831449de82e40d9488a38575 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23155)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-08 Thread via GitHub


hudi-bot commented on PR #10980:
URL: https://github.com/apache/hudi/pull/10980#issuecomment-2044125667

   
   ## CI report:
   
   * 36b0e8f8e5e00096b9844f8db6cc51cbc114f42c Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23148)
 
   * c382de2b71540404831449de82e40d9488a38575 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7391] HoodieMetadataMetrics should use Metrics instance for metrics registry [hudi]

2024-04-08 Thread via GitHub


nsivabalan commented on code in PR #10635:
URL: https://github.com/apache/hudi/pull/10635#discussion_r1556835447


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataWriteUtils.java:
##
@@ -200,6 +200,11 @@ public static HoodieWriteConfig createMetadataWriteConfig(
   builder.withProperties(datadogConfig.build().getProps());
   break;
 case PROMETHEUS:
+  HoodieMetricsPrometheusConfig prometheusConfig = 
HoodieMetricsPrometheusConfig.newBuilder()
+  .withPushgatewayLabels(writeConfig.getPushGatewayLabels())
+  .withPrometheusPortNum(writeConfig.getPrometheusPort()).build();

Review Comment:
   I checked Prometheus reporter and we need only prometheus port and push 
gateway labels 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7391] HoodieMetadataMetrics should use Metrics instance for metrics registry [hudi]

2024-04-08 Thread via GitHub


nsivabalan commented on code in PR #10635:
URL: https://github.com/apache/hudi/pull/10635#discussion_r1556836048


##
hudi-common/src/main/java/org/apache/hudi/metrics/Metrics.java:
##
@@ -176,4 +190,16 @@ public static boolean isInitialized(String basePath) {
 }
 return false;
   }
+
+  /**
+   * Use the same base path as the hudi table so that Metrics instance is 
shared.
+   */
+  private static String getBasePath(HoodieMetricsConfig metricsConfig) {
+String basePath = metricsConfig.getBasePath();
+if (basePath.endsWith(HoodieTableMetaClient.METADATA_TABLE_FOLDER_PATH)) {

Review Comment:
   my bad. 
   ```
 public static final String METADATA_TABLE_FOLDER_PATH = METAFOLDER_NAME + 
Path.SEPARATOR + "metadata"; 
   ```
   
   looks like we already account for what I asked for



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7395] Fix computation for metrics in HoodieMetadataMetrics [hudi]

2024-04-08 Thread via GitHub


nsivabalan commented on PR #10641:
URL: https://github.com/apache/hudi/pull/10641#issuecomment-2044100016

   hey @prashantwason : lets de-couple the fixes. 
   a. Fixing MDT to emit writer side metrics(commit duration, compaction 
duration etc)
   b. Fixing MDT to emit reader side metrics (col stats look up duration etc) 
during distributed registry. 
   
   I feel we should focus on (a) in this patch and get it landed. and you can 
put out a patch (I assume you folks already have a fix) for distributed 
registry based metrics from the executor. 
   
   If you are aligned on that, let us know if you have any feedback on this 
patch. or if we are good to go ahead. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-08 Thread via GitHub


beyond1920 commented on code in PR #10980:
URL: https://github.com/apache/hudi/pull/10980#discussion_r1556818304


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java:
##
@@ -147,6 +149,13 @@ public HoodieMergeHandle(HoodieWriteConfig config, String 
instantTime, HoodieTab
 this.preserveMetadata = true;
 init(fileId, this.partitionPath, dataFileToBeMerged);
 validateAndSetAndKeyGenProps(keyGeneratorOpt, config.populateMetaFields());
+// The compactor avoids heavy rewriting when copy the old record from old 
base file into new base file
+if (config.populateMetaFields()) {
+  LOG.info("Using update instead rewriting during compaction");

Review Comment:
   > Set the log as debug level
   
   Using info level here does not cost much, right?  It only prints logs in 
class constructor, not for each input record.
   
   >  "instead" -> "instead of".
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-08 Thread via GitHub


beyond1920 commented on code in PR #10980:
URL: https://github.com/apache/hudi/pull/10980#discussion_r1556818304


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java:
##
@@ -147,6 +149,13 @@ public HoodieMergeHandle(HoodieWriteConfig config, String 
instantTime, HoodieTab
 this.preserveMetadata = true;
 init(fileId, this.partitionPath, dataFileToBeMerged);
 validateAndSetAndKeyGenProps(keyGeneratorOpt, config.populateMetaFields());
+// The compactor avoids heavy rewriting when copy the old record from old 
base file into new base file
+if (config.populateMetaFields()) {
+  LOG.info("Using update instead rewriting during compaction");

Review Comment:
   > Set the log as debug level
   
   Using info level here does not cost much, right?  It only prints logs in 
class constructor, not for each input record.
   
   >  "instead" -> "instead of".
   
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-08 Thread via GitHub


beyond1920 commented on code in PR #10980:
URL: https://github.com/apache/hudi/pull/10980#discussion_r1556818304


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java:
##
@@ -147,6 +149,13 @@ public HoodieMergeHandle(HoodieWriteConfig config, String 
instantTime, HoodieTab
 this.preserveMetadata = true;
 init(fileId, this.partitionPath, dataFileToBeMerged);
 validateAndSetAndKeyGenProps(keyGeneratorOpt, config.populateMetaFields());
+// The compactor avoids heavy rewriting when copy the old record from old 
base file into new base file
+if (config.populateMetaFields()) {
+  LOG.info("Using update instead rewriting during compaction");

Review Comment:
   > Set the log as debug level
   
   Using info level here does not cost much, right? 
   
   >  "instead" -> "instead of".
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-08 Thread via GitHub


beyond1920 commented on code in PR #10980:
URL: https://github.com/apache/hudi/pull/10980#discussion_r1556817370


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java:
##
@@ -147,6 +149,13 @@ public HoodieMergeHandle(HoodieWriteConfig config, String 
instantTime, HoodieTab
 this.preserveMetadata = true;
 init(fileId, this.partitionPath, dataFileToBeMerged);
 validateAndSetAndKeyGenProps(keyGeneratorOpt, config.populateMetaFields());
+// if the old schema equals to the new schema, avoid heavy rewriting
+if (config.populateMetaFields() && useWriterSchemaForCompaction) {
+  LOG.info("Using update instead rewriting during compaction");
+  copyOldFunc = (key, record, schema, prop) -> 
this.updateMetadataToOldRecord(key, record, schema, prop);

Review Comment:
   Good question.
   The responsible of this method is only merging base record and incremental 
record, not including handle schema evolution. 
   Handling schema evolution happens before call the `HoodieMergeHandle#write` 
method.
   https://github.com/apache/hudi/assets/1525333/3a03e08b-fe2e-4da6-a788-07cbb6feeadd;>
   https://github.com/apache/hudi/assets/1525333/def1f2ee-ed97-47f8-92b6-76d45500bea7;>
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-08 Thread via GitHub


beyond1920 commented on code in PR #10980:
URL: https://github.com/apache/hudi/pull/10980#discussion_r1556818304


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java:
##
@@ -147,6 +149,13 @@ public HoodieMergeHandle(HoodieWriteConfig config, String 
instantTime, HoodieTab
 this.preserveMetadata = true;
 init(fileId, this.partitionPath, dataFileToBeMerged);
 validateAndSetAndKeyGenProps(keyGeneratorOpt, config.populateMetaFields());
+// The compactor avoids heavy rewriting when copy the old record from old 
base file into new base file
+if (config.populateMetaFields()) {
+  LOG.info("Using update instead rewriting during compaction");

Review Comment:
   > Set the log as debug level,



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] spark stuctrued streaming failed to update MDT metadata [hudi]

2024-04-08 Thread via GitHub


xicm commented on issue #10891:
URL: https://github.com/apache/hudi/issues/10891#issuecomment-2044061311

   The root cause is the deltacommit in MDT rollbacks the compaction 
instant(compaction in MDT is a deltacommit) in MDT.
   
   When a compaction starts, it will create a **inflight DeltaCommit** in MDT, 
because the compaction is asynchronous, the data ingestion will go on, the 
writer will start a new delta commit both in data table and MDT. In MDT, the 
new deltacommit will rollback the uncompleted deltacommit(it is created by the 
async compaction).
   
   Is it possible to filter the deltacommit created by compaction in MDT when 
we do rollback?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [SUPPORT]Exception when executing log compaction : Unsupported Operation Exception [hudi]

2024-04-08 Thread via GitHub


MrAladdin opened a new issue, #10982:
URL: https://github.com/apache/hudi/issues/10982

   
   **Describe the problem you faced**
   
   1、spark upsert hudi(mor)
   2、exception when executing log compaction : Unsupported Operation Exception
   3、org.apache.hudi.exception.HoodieRollbackException: Unknown listing type, 
during rollback of  [==>20240409000634923005__logcompaction__INFLIGHT] 
   I also want to know why after a log compaction exception, it remains in an 
inflight state, and the program does not exit abnormally.
   
   
   **Environment Description**
   
   * Hudi version :0.14.1
   
   * Spark version :3.4.1
   
   * Hive version :3.1.2
   
   * Hadoop version :3.1.3
   
   * Storage (HDFS/S3/GCS..) :hdfs
   
   * Running on Docker? (yes/no) :no
   
   
   **Additional context**
   
 .option("hoodie.metadata.enable", "true")
 .option("hoodie.metadata.index.async", "false")
 .option("hoodie.metadata.index.check.timeout.seconds", "900")
 .option("hoodie.auto.adjust.lock.configs", "true")
 .option("hoodie.metadata.optimized.log.blocks.scan.enable", "true")
 .option("hoodie.metadata.metrics.enable", "false")
 .option("hoodie.metadata.index.column.stats.enable", "false")
 .option("hoodie.metadata.compact.max.delta.commits", "10") 
   
 .option("hoodie.metadata.record.index.enable", "true")
 .option("hoodie.index.type", "RECORD_INDEX")
 .option("hoodie.metadata.max.init.parallelism", "10")
 .option("hoodie.metadata.record.index.min.filegroup.count", "10")
 .option("hoodie.metadata.record.index.max.filegroup.count", 
"1")
 .option("hoodie.metadata.record.index.max.filegroup.size", 
"1073741824")
 .option("hoodie.metadata.auto.initialize", "true")
 .option("hoodie.metadata.record.index.growth.factor", "2.0")
 .option("hoodie.metadata.max.logfile.size", "2147483648")
 .option("hoodie.metadata.log.compaction.enable", "true")
 .option("hoodie.metadata.log.compaction.blocks.threshold", "5")
   
 .option("hoodie.write.concurrency.mode", 
"optimistic_concurrency_control")
 .option("hoodie.write.lock.provider", 
"org.apache.hudi.client.transaction.lock.FileSystemBasedLockProvider")
 .option("hoodie.write.lock.filesystem.expire", "10")
   
   
   **Stacktrace**
   
   one exception:
   Job aborted due to stage failure: Task 6 in stage 203.0 failed 4 times, most 
recent failure: Lost task 6.3 in stage 203.0 (TID 4263) (11.slave.hdp executor 
13): org.apache.hudi.exception.HoodieException: Unsupported Operation Exception
at 
org.apache.hudi.common.util.collection.BitCaskDiskMap.values(BitCaskDiskMap.java:302)
at 
org.apache.hudi.common.util.collection.ExternalSpillableMap.values(ExternalSpillableMap.java:275)
at 
org.apache.hudi.table.HoodieSparkMergeOnReadTable.handleInsertsForLogCompaction(HoodieSparkMergeOnReadTable.java:206)
at 
org.apache.hudi.table.action.compact.LogCompactionExecutionHelper.writeFileAndGetWriteStats(LogCompactionExecutionHelper.java:79)
at 
org.apache.hudi.table.action.compact.HoodieCompactor.compact(HoodieCompactor.java:237)
at 
org.apache.hudi.table.action.compact.HoodieCompactor.lambda$compact$988df80a$1(HoodieCompactor.java:132)
at 
org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
at 
org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:223)
at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:352)
at 
org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1552)
at 
org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1462)
at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1526)
at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1349)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:375)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:326)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
at 
org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
at org.apache.spark.scheduler.Task.run(Task.scala:139)
at 

Re: [I] [Inquiry] Does HoodieIndexer can Do Indexing for RLI Async Fashion [hudi]

2024-04-08 Thread via GitHub


nsivabalan commented on issue #10815:
URL: https://github.com/apache/hudi/issues/10815#issuecomment-2044048808

   hey @ad1happy2go @codope : looks like there is some mis understanging on how 
to use async indexer. 
   when enabling async indexer to build say RLI, ingestion also should have 
async indexing enable for RLI. we can't completely disable from regular 
ingestion job. Can you folks follow up on any doc enhancements. 
   CC @soumilshah1995  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Duplicate Row in Same Partition using Global Bloom Index [hudi]

2024-04-08 Thread via GitHub


nsivabalan commented on issue #9536:
URL: https://github.com/apache/hudi/issues/9536#issuecomment-2044042739

   hey @Raghvendradubey : any follow ups on this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]Data loss occurs when using bulkinsert [hudi]

2024-04-08 Thread via GitHub


nsivabalan commented on issue #9748:
URL: https://github.com/apache/hudi/issues/9748#issuecomment-2044042481

   hey @ad1happy2go : any follow up on this. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] After enable speculation execution of spark compaction job, some broken parquet files might be generated [hudi]

2024-04-08 Thread via GitHub


nsivabalan commented on issue #9615:
URL: https://github.com/apache/hudi/issues/9615#issuecomment-2044040888

   We gonna attempt at fixing the issue on this using completion markers. Will 
post an update shortly on how we plan to tackle this. 
   But in the mean time, curious to know how you folks are detecting these 
additional parquet files. There are chances it could lead to duplicates right? 
how you folks are managing to void data consistency issues? 
   Until we have a proper fix, trying to gauge if we can suggest some 
workarounds for other Hudi OSS users. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Enable Hudi Metadata Table and Multi-Modal Index bug [hudi]

2024-04-08 Thread via GitHub


nsivabalan commented on issue #9672:
URL: https://github.com/apache/hudi/issues/9672#issuecomment-2044037688

   hey @MorningGlow : any follow ups on this? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] too many s3 list when hoodie.metadata.enable=true [hudi]

2024-04-08 Thread via GitHub


nsivabalan commented on issue #9751:
URL: https://github.com/apache/hudi/issues/9751#issuecomment-2044036786

   hey @njalan @BruceKellan : any follow ups on this. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Facing java.util.NoSuchElementException on EMR 6.12 (Hudi 0.13) with inline compaction and cleaning on MoR tables [hudi]

2024-04-08 Thread via GitHub


nsivabalan commented on issue #9861:
URL: https://github.com/apache/hudi/issues/9861#issuecomment-2044035691

   hey @ad1happy2go : any follow ups on this? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Compaction error [hudi]

2024-04-08 Thread via GitHub


nsivabalan commented on issue #9885:
URL: https://github.com/apache/hudi/issues/9885#issuecomment-2044033752

   hey @ad1happy2go : reminder to follow up on this


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] AWS Athena query fail when compaction is scheduled for MOR table [hudi]

2024-04-08 Thread via GitHub


nsivabalan commented on issue #9907:
URL: https://github.com/apache/hudi/issues/9907#issuecomment-2044029051

   hey @codope @rahil-c : is athena querying hudi related issues are all fixed 
as of now? or do we still have any pending gaps. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Data loss in MOR table after clustering partition [hudi]

2024-04-08 Thread via GitHub


nsivabalan commented on issue #9977:
URL: https://github.com/apache/hudi/issues/9977#issuecomment-2044027211

   hey @ad1happy2go : whats the follow up on this. do we need to make any fixes 
to hudi. or doc enhancements etc? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Query failure due to replacecommit being archived [hudi]

2024-04-08 Thread via GitHub


nsivabalan commented on issue #10107:
URL: https://github.com/apache/hudi/issues/10107#issuecomment-2044026284

   hey @haoxie-aws : the link PRs should fix the issue reported. 
   are you facing the issue after 0.14.1 as well ? 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Additional records in dataset after clustering [hudi]

2024-04-08 Thread via GitHub


nsivabalan commented on issue #10172:
URL: https://github.com/apache/hudi/issues/10172#issuecomment-2044025853

   hey @noahtaite : any follow ups on this.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Compaction & Clustering are not working [hudi]

2024-04-08 Thread via GitHub


nsivabalan commented on issue #10183:
URL: https://github.com/apache/hudi/issues/10183#issuecomment-2044025493

   hey @ad1happy2go : can you follow up on this. 
   @Cpandey43 : yes you are right. enabling async w/ batch writers like 
spark-ds does not mean much. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] INSERT_OVERWRITE_TABLE on subsequent runs fails with a metadata file not found error (v0.14.0) [hudi]

2024-04-08 Thread via GitHub


nsivabalan commented on issue #10445:
URL: https://github.com/apache/hudi/issues/10445#issuecomment-2044023506

   just to get past the issue, you can completely delete the table and rewrite. 
or use overwrite mode w/ spark. until we have a proper fix. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7575] avoid repeated fetching of pending replace instants [hudi]

2024-04-08 Thread via GitHub


danny0405 commented on code in PR #10976:
URL: https://github.com/apache/hudi/pull/10976#discussion_r1556736368


##
hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java:
##
@@ -140,6 +141,22 @@ protected void init(HoodieTableMetaClient metaClient, 
HoodieTimeline visibleActi
*/
   protected void refreshTimeline(HoodieTimeline visibleActiveTimeline) {
 this.visibleCommitsAndCompactionTimeline = 
visibleActiveTimeline.getWriteTimeline();
+this.timelineHashAndPendingReplaceInstants = null;
+  }
+
+  /**
+   * Get a list of pending replace instants. Caches the result for the active 
timeline.
+   * The cache is invalidated when {@link #refreshTimeline(HoodieTimeline)} is 
called.
+   *
+   * @return list of pending replace instant timestamps
+   */
+  private List getPendingReplaceInstants() {
+HoodieActiveTimeline activeTimeline = metaClient.getActiveTimeline();

Review Comment:
   > Can't multiple threads access the same timeline?
   
   It could, and we should introduce some synchronized code guard for the 
access of the cache, we already did that for some caches in the timeline.
   
   > What do you mean by "map cache"?
   
   My typo, it's the "Pair" cache here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Upsert operation not working and job is running longer while using "Record level index" in Apache Hudi 0.14 in EMR 6.15 [hudi]

2024-04-08 Thread via GitHub


nsivabalan commented on issue #10587:
URL: https://github.com/apache/hudi/issues/10587#issuecomment-2043999240

   hey @ad1happy2go : do let me know if we find any data consistency issues w/ 
MDT or RLI. thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] RLI Spark Hudi Error occurs when executing map [hudi]

2024-04-08 Thread via GitHub


nsivabalan commented on issue #10609:
URL: https://github.com/apache/hudi/issues/10609#issuecomment-2043998416

   and @ad1happy2go : if you encounter any bugs wrt MDT or RLI, do keep me 
posted. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] RLI Spark Hudi Error occurs when executing map [hudi]

2024-04-08 Thread via GitHub


nsivabalan commented on issue #10609:
URL: https://github.com/apache/hudi/issues/10609#issuecomment-2043998156

   hey @bksrepo : can you file a new issue
   hey @ad1happy2go : if the original issue is resolved, can we close it out. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] File not found while using metadata table for insert_overwrite table [hudi]

2024-04-08 Thread via GitHub


nsivabalan commented on issue #10628:
URL: https://github.com/apache/hudi/issues/10628#issuecomment-2043996684

   hey @ad1happy2go : if this turns out to be MDT data consistency issue, do 
keep me posted. thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (HUDI-7574) Auto-pilot for Flink Hudi sink tasks

2024-04-08 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17835092#comment-17835092
 ] 

Vinoth Chandar commented on HUDI-7574:
--

We need to rethink these singleton tasks like cleaning etc. 

> Auto-pilot for Flink Hudi sink tasks
> 
>
> Key: HUDI-7574
> URL: https://issues.apache.org/jira/browse/HUDI-7574
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: flink
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.0.0
>
>
> Currently the flink write task parallelism is set up through 
> {code:java}
> write.tasks{code}
> it is kind of a fixed number during the lifecycle of the ingestion pipeline, 
> while for streaming, there are always fluctuation of the workload, it is 
> great if we can tune the parallelism of write tasks based on the job load 
> profile dynamically.
> On K8s, Flink provides a 
> [https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/autoscaler/]
>  which is suitable for the purpose, which deserves a further investigation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7574) Auto-pilot for Flink Hudi sink tasks

2024-04-08 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-7574:
-
Status: In Progress  (was: Open)

> Auto-pilot for Flink Hudi sink tasks
> 
>
> Key: HUDI-7574
> URL: https://issues.apache.org/jira/browse/HUDI-7574
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: flink
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.0.0
>
>
> Currently the flink write task parallelism is set up through 
> {code:java}
> write.tasks{code}
> it is kind of a fixed number during the lifecycle of the ingestion pipeline, 
> while for streaming, there are always fluctuation of the workload, it is 
> great if we can tune the parallelism of write tasks based on the job load 
> profile dynamically.
> On K8s, Flink provides a 
> [https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/autoscaler/]
>  which is suitable for the purpose, which deserves a further investigation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [I] [SUPPORT] Duplicate data in base file of MOR table [hudi]

2024-04-08 Thread via GitHub


nsivabalan commented on issue #10882:
URL: https://github.com/apache/hudi/issues/10882#issuecomment-2043992885

   hey @ad1happy2go : if this is related to MDT, can you let me know. I am 
trying to take stock of all MDT data consistency related issues. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7577) Avoid MDT compaction instant time conflicts

2024-04-08 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-7577:
-
Status: In Progress  (was: Open)

> Avoid MDT compaction instant time conflicts
> ---
>
> Key: HUDI-7577
> URL: https://issues.apache.org/jira/browse/HUDI-7577
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7572) Avoid to schedule empty compaction plan without log files

2024-04-08 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-7572:
-
Reviewers: Ethan Guo, Sagar Sumit

> Avoid to schedule empty compaction plan without log files
> -
>
> Key: HUDI-7572
> URL: https://issues.apache.org/jira/browse/HUDI-7572
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: table-service
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> After change to [loosen the compaction for 
> MDT|https://issues.apache.org/jira/browse/HUDI-7572], there is rare case the 
> same compaction instant time got used to schedule for multiple times, we 
> better optimize the compactor to avoid empty compaction plan generation.
> Note: although we have a active timeline check to avoid the repetative 
> scheduling, there is still little chance the compaction already got archived.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [I] [SUPPORT] IllegalArgumentException at org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:33) [hudi]

2024-04-08 Thread via GitHub


nsivabalan commented on issue #10906:
URL: https://github.com/apache/hudi/issues/10906#issuecomment-2043989098

   CC @linliu-code 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] No way to clean `archived/` folder [hudi]

2024-04-08 Thread via GitHub


nsivabalan commented on issue #10930:
URL: https://github.com/apache/hudi/issues/10930#issuecomment-2043988319

   may be we should introduce a ArchivalClean table service to auto clean files 
older than say 2 months. Not many users are going to inspect archival timeline 
after 2+ months. and it will avoid accumulating entire history. Interested 
users can still choose to not clean it up. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [Feature Inquiry] index for randomized upserts [hudi]

2024-04-08 Thread via GitHub


nsivabalan commented on issue #10961:
URL: https://github.com/apache/hudi/issues/10961#issuecomment-2043987312

   just a note. 0.14.1 RLI is a substitute for global index and not any index. 
   for eg, if you were using bloom, you can't replace it w/ RLI. 
   Current RLI cannot support same record keys across two partitions. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Rollback failed clustering 0.12.2 [hudi]

2024-04-08 Thread via GitHub


nsivabalan commented on issue #10964:
URL: https://github.com/apache/hudi/issues/10964#issuecomment-2043986341

   hey @suryaprasanna : Can you take this up and offer some suggestions. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7575] avoid repeated fetching of pending replace instants [hudi]

2024-04-08 Thread via GitHub


the-other-tim-brown commented on code in PR #10976:
URL: https://github.com/apache/hudi/pull/10976#discussion_r1556695938


##
hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java:
##
@@ -140,6 +141,22 @@ protected void init(HoodieTableMetaClient metaClient, 
HoodieTimeline visibleActi
*/
   protected void refreshTimeline(HoodieTimeline visibleActiveTimeline) {
 this.visibleCommitsAndCompactionTimeline = 
visibleActiveTimeline.getWriteTimeline();
+this.timelineHashAndPendingReplaceInstants = null;
+  }
+
+  /**
+   * Get a list of pending replace instants. Caches the result for the active 
timeline.
+   * The cache is invalidated when {@link #refreshTimeline(HoodieTimeline)} is 
called.
+   *
+   * @return list of pending replace instant timestamps
+   */
+  private List getPendingReplaceInstants() {
+HoodieActiveTimeline activeTimeline = metaClient.getActiveTimeline();

Review Comment:
   > > It seems like it may make sense long term to return the same instance 
whenever possible to benefit from this cache.
   > 
   > There should be no much difference because the map cache you use also has 
per-timeline granularity. The benefit to move to the timeline itself is for 
better maintainance.
   > 
   What do you mean by "map cache"?
   
   > And if we move the cache inside the timeline, there should not be thread 
access conflicts.
   
   Why is that? Can't multiple threads access the same timeline?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-08 Thread via GitHub


danny0405 commented on code in PR #10980:
URL: https://github.com/apache/hudi/pull/10980#discussion_r1556687381


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java:
##
@@ -147,6 +149,13 @@ public HoodieMergeHandle(HoodieWriteConfig config, String 
instantTime, HoodieTab
 this.preserveMetadata = true;
 init(fileId, this.partitionPath, dataFileToBeMerged);
 validateAndSetAndKeyGenProps(keyGeneratorOpt, config.populateMetaFields());
+// if the old schema equals to the new schema, avoid heavy rewriting
+if (config.populateMetaFields() && useWriterSchemaForCompaction) {
+  LOG.info("Using update instead rewriting during compaction");
+  copyOldFunc = (key, record, schema, prop) -> 
this.updateMetadataToOldRecord(key, record, schema, prop);

Review Comment:
   but it still uses the latest schema as the write schema, how about the 
schema already evolved?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-08 Thread via GitHub


danny0405 commented on code in PR #10980:
URL: https://github.com/apache/hudi/pull/10980#discussion_r1556687736


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java:
##
@@ -147,6 +149,13 @@ public HoodieMergeHandle(HoodieWriteConfig config, String 
instantTime, HoodieTab
 this.preserveMetadata = true;
 init(fileId, this.partitionPath, dataFileToBeMerged);
 validateAndSetAndKeyGenProps(keyGeneratorOpt, config.populateMetaFields());
+// The compactor avoids heavy rewriting when copy the old record from old 
base file into new base file
+if (config.populateMetaFields()) {
+  LOG.info("Using update instead rewriting during compaction");

Review Comment:
   Set the log as debug level, "instead" -> "instead of".



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7503] Compaction and LogCompaction executions should start a heartbeat on every attempt and block concurrent executions of same plan [hudi]

2024-04-08 Thread via GitHub


danny0405 commented on code in PR #10965:
URL: https://github.com/apache/hudi/pull/10965#discussion_r1556682595


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java:
##
@@ -1135,8 +1137,34 @@ protected void 
completeLogCompaction(HoodieCommitMetadata metadata, HoodieTable
*/
   protected HoodieWriteMetadata compact(String compactionInstantTime, 
boolean shouldComplete) {
 HoodieTable table = createTable(config, context.getHadoopConf().get());
+Option instantToCompactOption = 
Option.fromJavaOptional(table.getActiveTimeline()
+.filterCompletedAndCompactionInstants()
+.getInstants()
+.stream()
+.filter(instant -> 
HoodieActiveTimeline.EQUALS.test(instant.getTimestamp(), compactionInstantTime))

Review Comment:
   we should only care about pending instant right? If the compaction already 
completed, just skip this run.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7503] Compaction and LogCompaction executions should start a heartbeat on every attempt and block concurrent executions of same plan [hudi]

2024-04-08 Thread via GitHub


danny0405 commented on code in PR #10965:
URL: https://github.com/apache/hudi/pull/10965#discussion_r1556682151


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java:
##
@@ -1135,8 +1138,36 @@ protected void 
completeLogCompaction(HoodieCommitMetadata metadata, HoodieTable
*/
   protected HoodieWriteMetadata compact(String compactionInstantTime, 
boolean shouldComplete) {
 HoodieTable table = createTable(config, context.getHadoopConf().get());
+Option instantToCompactOption = 
Option.fromJavaOptional(table.getActiveTimeline()
+.filterCompletedAndCompactionInstants()
+.getInstants()
+.stream()
+.filter(instant -> 
HoodieActiveTimeline.EQUALS.test(instant.getTimestamp(), compactionInstantTime))
+.findFirst());
+try {
+  // Transaction serves to ensure only one compact job for this instant 
will start heartbeat, and any other concurrent
+  // compact job will abort if they attempt to execute compact before 
heartbeat expires
+  // Note that as long as all jobs for this table use this API for 
compact, then this alone should prevent
+  // compact rollbacks from running concurrently to compact commits.
+  txnManager.beginTransaction(instantToCompactOption, 
txnManager.getLastCompletedTransactionOwner());

Review Comment:
   yeah, even if the state is requested, we should check the heartbeat liveness.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7503] Compaction and LogCompaction executions should start a heartbeat on every attempt and block concurrent executions of same plan [hudi]

2024-04-08 Thread via GitHub


danny0405 commented on code in PR #10965:
URL: https://github.com/apache/hudi/pull/10965#discussion_r1554475930


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java:
##
@@ -1135,8 +1138,36 @@ protected void 
completeLogCompaction(HoodieCommitMetadata metadata, HoodieTable
*/
   protected HoodieWriteMetadata compact(String compactionInstantTime, 
boolean shouldComplete) {
 HoodieTable table = createTable(config, context.getHadoopConf().get());
+Option instantToCompactOption = 
Option.fromJavaOptional(table.getActiveTimeline()
+.filterCompletedAndCompactionInstants()
+.getInstants()
+.stream()
+.filter(instant -> 
HoodieActiveTimeline.EQUALS.test(instant.getTimestamp(), compactionInstantTime))
+.findFirst());
+try {
+  // Transaction serves to ensure only one compact job for this instant 
will start heartbeat, and any other concurrent
+  // compact job will abort if they attempt to execute compact before 
heartbeat expires
+  // Note that as long as all jobs for this table use this API for 
compact, then this alone should prevent
+  // compact rollbacks from running concurrently to compact commits.
+  txnManager.beginTransaction(instantToCompactOption, 
txnManager.getLastCompletedTransactionOwner());

Review Comment:
   When a conflict for the same compaction instant execution is detected, we 
can:
   
   1. check the state of the instant, if it is in `INFLIGHT` state and 1.1) the 
heartbeat expires, we can just rollback the last execution and reattempt in 
this run; 1.2) if the heartbeat does not expire, just take the execution of 
this run and log a wanning log there.
   2. if the state if still `REQUESTED`, we can execute it direcly?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7575] avoid repeated fetching of pending replace instants [hudi]

2024-04-08 Thread via GitHub


danny0405 commented on code in PR #10976:
URL: https://github.com/apache/hudi/pull/10976#discussion_r1556677633


##
hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java:
##
@@ -140,6 +141,22 @@ protected void init(HoodieTableMetaClient metaClient, 
HoodieTimeline visibleActi
*/
   protected void refreshTimeline(HoodieTimeline visibleActiveTimeline) {
 this.visibleCommitsAndCompactionTimeline = 
visibleActiveTimeline.getWriteTimeline();
+this.timelineHashAndPendingReplaceInstants = null;
+  }
+
+  /**
+   * Get a list of pending replace instants. Caches the result for the active 
timeline.
+   * The cache is invalidated when {@link #refreshTimeline(HoodieTimeline)} is 
called.
+   *
+   * @return list of pending replace instant timestamps
+   */
+  private List getPendingReplaceInstants() {
+HoodieActiveTimeline activeTimeline = metaClient.getActiveTimeline();

Review Comment:
   > It seems like it may make sense long term to return the same instance 
whenever possible to benefit from this cache.
   
   There should be no much difference because the map cache you use also has 
per-timeline granularity. The benefit to move to the timeline itself is for 
better maintainance.
   
   And if we move the cache inside the timeline, there should not be thread 
access conflicts.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7503] Compaction and LogCompaction executions should start a heartbeat on every attempt and block concurrent executions of same plan [hudi]

2024-04-08 Thread via GitHub


hudi-bot commented on PR #10965:
URL: https://github.com/apache/hudi/pull/10965#issuecomment-2043939357

   
   ## CI report:
   
   * e1a6e4a24083dd8871a2fc3fbb289e1a6192593a Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23154)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7395] Fix computation for metrics in HoodieMetadataMetrics [hudi]

2024-04-08 Thread via GitHub


prashantwason commented on code in PR #10641:
URL: https://github.com/apache/hudi/pull/10641#discussion_r1556617105


##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataMetrics.java:
##
@@ -136,7 +144,7 @@ public void updateMetrics(String action, long durationInMs) 
{
 String countKey = action + ".count";
 String durationKey = action + ".totalDuration";
 incrementMetric(countKey, 1);
-incrementMetric(durationKey, durationInMs);
+setMetric(durationKey, durationInMs);

Review Comment:
   You are assuming that code calling these functions would only call once. 
That may not be a correct assumption for all cases - opening the MDT is costly 
so multiple lookups etc can be called on open MDT readers. 



##
hudi-common/src/main/java/org/apache/hudi/metadata/BaseTableMetadata.java:
##
@@ -302,8 +303,8 @@ public Map 
readRecordIndex(List reco
 });
 
 metrics.ifPresent(m -> 
m.updateMetrics(HoodieMetadataMetrics.LOOKUP_RECORD_INDEX_TIME_STR, 
timer.endTimer()));
-metrics.ifPresent(m -> 
m.updateMetrics(HoodieMetadataMetrics.LOOKUP_RECORD_INDEX_KEYS_COUNT_STR, 
recordKeys.size()));
-metrics.ifPresent(m -> 
m.updateMetrics(HoodieMetadataMetrics.LOOKUP_RECORD_INDEX_KEYS_HITS_COUNT_STR, 
recordKeyToLocation.size()));
+metrics.ifPresent(m -> 
m.setMetric(HoodieMetadataMetrics.LOOKUP_RECORD_INDEX_KEYS_COUNT_STR, 
recordKeys.size()));

Review Comment:
   The same HoodieTableMetadata object can be used to lookup keys from MDT 
multiple times. In that case, update is more accurate.
   



##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataMetrics.java:
##
@@ -73,10 +79,12 @@ public class HoodieMetadataMetrics implements Serializable {
 
   private static final Logger LOG = 
LoggerFactory.getLogger(HoodieMetadataMetrics.class);
 
-  private final Registry metricsRegistry;
+  private final transient MetricRegistry metricsRegistry;
+  private final transient Metrics metrics;
 
-  public HoodieMetadataMetrics(Registry metricsRegistry) {
-this.metricsRegistry = metricsRegistry;
+  public HoodieMetadataMetrics(HoodieMetricsConfig metricsConfig) {

Review Comment:
   If you do not use Registry then no metrics can be collected from the 
executors where most of the operations on the MDT readers take place (for 
indexes other than files index).
   
   Eg. RI lookup -> since there are multiple file groups in record_index, when 
looking up keys from the record index, each executor opens one file group of 
the record index and reads the keys that belong to that file group. When 
HoodieTableMetadata is serialized by Spark and send to the executors, the 
executors end up updating a local copy of the metadata metrics. since the 
publishing of the metrics is only done on the driver side, the metrics updated 
on the executor side never make it to the driver side and hence never published.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7503] Compaction and LogCompaction executions should start a heartbeat on every attempt and block concurrent executions of same plan [hudi]

2024-04-08 Thread via GitHub


hudi-bot commented on PR #10965:
URL: https://github.com/apache/hudi/pull/10965#issuecomment-2043855651

   
   ## CI report:
   
   * c41af6435281865147967768419da5e4fb688f8b Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23153)
 
   * e1a6e4a24083dd8871a2fc3fbb289e1a6192593a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23154)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7503] Compaction and LogCompaction executions should start a heartbeat on every attempt and block concurrent executions of same plan [hudi]

2024-04-08 Thread via GitHub


hudi-bot commented on PR #10965:
URL: https://github.com/apache/hudi/pull/10965#issuecomment-2043839847

   
   ## CI report:
   
   * c41af6435281865147967768419da5e4fb688f8b Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23153)
 
   * e1a6e4a24083dd8871a2fc3fbb289e1a6192593a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi CLI bundle not working [hudi]

2024-04-08 Thread via GitHub


mansipp commented on issue #10566:
URL: https://github.com/apache/hudi/issues/10566#issuecomment-2043833097

   Getting the similar error while running the `commit rollback`
   ```
   commit rollback --commit 20240408231846380
   24/04/08 23:22:02 INFO InputStreamConsumer: Apr 08, 2024 11:22:02 PM 
org.apache.spark.launcher.Log4jHotPatchOption staticJavaAgentOption
   24/04/08 23:22:02 INFO InputStreamConsumer: WARNING: 
spark.log4jHotPatch.enabled is set to true, but 
/usr/share/log4j-cve-2021-44228-hotpatch/jdk17/Log4jHotPatchFat.jar does not 
exist at the configured location
   24/04/08 23:22:02 INFO InputStreamConsumer:
   24/04/08 23:22:03 INFO InputStreamConsumer: Error: Failed to load 
org.apache.hudi.cli.commands.SparkMain: 
org/apache/hudi/common/engine/HoodieEngineContext
   24/04/08 23:22:03 INFO InputStreamConsumer: 24/04/08 23:22:03 INFO 
ShutdownHookManager: Shutdown hook called
   24/04/08 23:22:03 INFO InputStreamConsumer: 24/04/08 23:22:03 INFO 
ShutdownHookManager: Deleting directory 
/mnt/tmp/spark-272bb6ef-f858-42a6-b9d0-9614f1f36371
   24/04/08 23:22:03 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
from 
s3://mansipp-emr-dev/hudi_cli_migration/tables/mor/mansipp_hudi_mor_table_2/
   24/04/08 23:22:03 INFO HoodieTableConfig: Loading table properties from 
s3://mansipp-emr-dev/hudi_cli_migration/tables/mor/mansipp_hudi_mor_table_2/.hoodie/hoodie.properties
   24/04/08 23:22:03 INFO S3NativeFileSystem: Opening 
's3://mansipp-emr-dev/hudi_cli_migration/tables/mor/mansipp_hudi_mor_table_2/.hoodie/hoodie.properties'
 for reading
   24/04/08 23:22:03 INFO HoodieTableMetaClient: Finished Loading Table of type 
MERGE_ON_READ(version=1, baseFileFormat=PARQUET) from 
s3://mansipp-emr-dev/hudi_cli_migration/tables/mor/mansipp_hudi_mor_table_2/
   Commit 20240408231846380 failed to roll back```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch asf-site updated: [DOCS] Updates slack link across site (#10981)

2024-04-08 Thread bhavanisudha
This is an automated email from the ASF dual-hosted git repository.

bhavanisudha pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new a4ec3fc9016 [DOCS] Updates slack link across site (#10981)
a4ec3fc9016 is described below

commit a4ec3fc90168229b8d76dfd95b453d9da66cca36
Author: Bhavani Sudha Saktheeswaran <2179254+bhasu...@users.noreply.github.com>
AuthorDate: Mon Apr 8 15:35:46 2024 -0700

[DOCS] Updates slack link across site (#10981)
---
 ...021-12-16-lakehouse-concurrency-control-are-we-too-optimistic.md | 2 +-
 website/blog/2022-01-06-apache-hudi-2021-a-year-in-review.md| 4 ++--
 .../2022-01-14-change-data-capture-with-debezium-and-apache-hudi.md | 2 +-
 website/blog/2022-12-29-Apache-Hudi-2022-A-Year-In-Review.md| 6 +++---
 website/blog/2023-12-28-apache-hudi-2023-a-year-in-review.md| 2 +-
 website/community/get-involved.md   | 2 +-
 website/docs/overview.md| 4 ++--
 website/docusaurus.config.js| 4 ++--
 website/i18n/cn/docusaurus-plugin-content-pages/get-involved.md | 2 +-
 website/i18n/cn/docusaurus-theme-classic/footer.json| 2 +-
 website/sidebars.js | 2 +-
 website/sidebarsCommunity.js| 2 +-
 website/sidebarsContribute.js   | 2 +-
 website/src/components/JoinCommunity/index.js   | 2 +-
 website/src/pages/powered-by.md | 2 +-
 website/versioned_docs/version-0.10.0/overview.md   | 4 ++--
 website/versioned_docs/version-0.10.1/overview.md   | 4 ++--
 website/versioned_docs/version-0.11.0/overview.md   | 4 ++--
 website/versioned_docs/version-0.11.1/overview.md   | 4 ++--
 website/versioned_docs/version-0.12.0/overview.md   | 4 ++--
 website/versioned_docs/version-0.12.1/overview.md   | 4 ++--
 website/versioned_docs/version-0.12.2/overview.md   | 4 ++--
 website/versioned_docs/version-0.12.3/overview.md   | 4 ++--
 website/versioned_docs/version-0.13.0/overview.md   | 4 ++--
 website/versioned_docs/version-0.13.1/overview.md   | 4 ++--
 website/versioned_docs/version-0.14.0/overview.md   | 4 ++--
 website/versioned_docs/version-0.14.1/overview.md   | 4 ++--
 website/versioned_sidebars/version-0.10.0-sidebars.json | 2 +-
 website/versioned_sidebars/version-0.10.1-sidebars.json | 2 +-
 website/versioned_sidebars/version-0.11.0-sidebars.json | 2 +-
 website/versioned_sidebars/version-0.11.1-sidebars.json | 2 +-
 website/versioned_sidebars/version-0.12.0-sidebars.json | 2 +-
 website/versioned_sidebars/version-0.12.1-sidebars.json | 2 +-
 website/versioned_sidebars/version-0.12.2-sidebars.json | 2 +-
 website/versioned_sidebars/version-0.12.3-sidebars.json | 2 +-
 website/versioned_sidebars/version-0.13.0-sidebars.json | 2 +-
 website/versioned_sidebars/version-0.13.1-sidebars.json | 2 +-
 website/versioned_sidebars/version-0.14.0-sidebars.json | 2 +-
 website/versioned_sidebars/version-0.14.1-sidebars.json | 2 +-
 website/versioned_sidebars/version-0.9.0-sidebars.json  | 2 +-
 40 files changed, 57 insertions(+), 57 deletions(-)

diff --git 
a/website/blog/2021-12-16-lakehouse-concurrency-control-are-we-too-optimistic.md
 
b/website/blog/2021-12-16-lakehouse-concurrency-control-are-we-too-optimistic.md
index a06b1065601..2d90dea745b 100644
--- 
a/website/blog/2021-12-16-lakehouse-concurrency-control-are-we-too-optimistic.md
+++ 
b/website/blog/2021-12-16-lakehouse-concurrency-control-are-we-too-optimistic.md
@@ -54,4 +54,4 @@ All this said, there are still many ways we can improve upon 
this foundation.
 *   While optimistic concurrency control is attractive when serializable 
snapshot isolation is desired, it's neither optimal nor the only method for 
dealing with concurrency between writers. We plan to implement a fully 
lock-free concurrency control using CRDTs and widely adopted stream processing 
concepts, over our log [merge 
API](https://github.com/apache/hudi/blob/bc8bf043d5512f7afbb9d94882c4e43ee61d6f06/hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecordPayload.java#L
 [...]
 *   Touching upon key constraints, Hudi is the only lake transactional layer 
that ensures unique [key](https://hudi.apache.org/docs/key_generation) 
constraints today, but limited to the record key of the table. We will be 
looking to expand this capability in a more general form to non-primary key 

Re: [PR] [DOCS] Updates slack link across site [hudi]

2024-04-08 Thread via GitHub


bhasudha merged PR #10981:
URL: https://github.com/apache/hudi/pull/10981


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7503] Compaction and LogCompaction executions should start a heartbeat on every attempt and block concurrent executions of same plan [hudi]

2024-04-08 Thread via GitHub


hudi-bot commented on PR #10965:
URL: https://github.com/apache/hudi/pull/10965#issuecomment-2043740065

   
   ## CI report:
   
   * c41af6435281865147967768419da5e4fb688f8b Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23153)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [DOCS] Updates slack link across site [hudi]

2024-04-08 Thread via GitHub


bhasudha commented on PR #10981:
URL: https://github.com/apache/hudi/pull/10981#issuecomment-2043726886

   Tested locally
   ![Screenshot 2024-04-08 at 3 06 27 
PM](https://github.com/apache/hudi/assets/2179254/9070ea06-7658-4f85-a627-10339de6051c)
   ![Screenshot 2024-04-08 at 3 05 08 
PM](https://github.com/apache/hudi/assets/2179254/88681875-0166-43d1-a69f-e27508506c7c)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] [DOCS] Updates slack link across site [hudi]

2024-04-08 Thread via GitHub


bhasudha opened a new pull request, #10981:
URL: https://github.com/apache/hudi/pull/10981

   ### Change Logs
   
   Update slack link due to expiry of old one.
   
   ### Impact
   
   Slack link update across website.
   
   ### Risk level (write none, low medium or high below)
   
   low. site update.
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change. If not, put "none"._
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (HUDI-6787) Hive Integrate FileGroupReader with HoodieMergeOnReadSnapshotReader and RealtimeCompactedRecordReader for Hive

2024-04-08 Thread Jonathan Vexler (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17835052#comment-17835052
 ] 

Jonathan Vexler commented on HUDI-6787:
---

{code:java}
root@adhoc-2:/opt# spark-submit \
>   --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
> $HUDI_UTILITIES_BUNDLE \
>   --table-type COPY_ON_WRITE \
>   --source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
>   --source-ordering-field ts  \
>   --target-base-path /user/hive/warehouse/stock_ticks_cow \
>   --target-table stock_ticks_cow --props 
> /var/demo/config/kafka-source.properties \
>   --schemaprovider-class 
> org.apache.hudi.utilities.schema.FilebasedSchemaProvider
2024-04-08 21:13:35,067 WARN streamer.SchedulerConfGenerator: Job Scheduling 
Configs will not be in effect as spark.scheduler.mode is not set to FAIR at 
instantiation time. Continuing without scheduling configs
2024-04-08 21:13:35,211 INFO spark.SparkContext: Running Spark version 3.2.1
2024-04-08 21:13:35,247 WARN util.NativeCodeLoader: Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable
2024-04-08 21:13:35,346 INFO resource.ResourceUtils: 
==
2024-04-08 21:13:35,347 INFO resource.ResourceUtils: No custom resources 
configured for spark.driver.
2024-04-08 21:13:35,347 INFO resource.ResourceUtils: 
==
2024-04-08 21:13:35,348 INFO spark.SparkContext: Submitted application: 
streamer-stock_ticks_cow
2024-04-08 21:13:35,383 INFO resource.ResourceProfile: Default ResourceProfile 
created, executor resources: Map(cores -> name: cores, amount: 1, script: , 
vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> 
name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> 
name: cpus, amount: 1.0)
2024-04-08 21:13:35,396 INFO resource.ResourceProfile: Limiting resource is cpu
2024-04-08 21:13:35,396 INFO resource.ResourceProfileManager: Added 
ResourceProfile id: 0
2024-04-08 21:13:35,461 INFO spark.SecurityManager: Changing view acls to: root
2024-04-08 21:13:35,461 INFO spark.SecurityManager: Changing modify acls to: 
root
2024-04-08 21:13:35,462 INFO spark.SecurityManager: Changing view acls groups 
to: 
2024-04-08 21:13:35,462 INFO spark.SecurityManager: Changing modify acls groups 
to: 
2024-04-08 21:13:35,463 INFO spark.SecurityManager: SecurityManager: 
authentication disabled; ui acls disabled; users  with view permissions: 
Set(root); groups with view permissions: Set(); users  with modify permissions: 
Set(root); groups with modify permissions: Set()
2024-04-08 21:13:35,512 INFO Configuration.deprecation: 
mapred.output.compression.codec is deprecated. Instead, use 
mapreduce.output.fileoutputformat.compress.codec
2024-04-08 21:13:35,513 INFO Configuration.deprecation: mapred.output.compress 
is deprecated. Instead, use mapreduce.output.fileoutputformat.compress
2024-04-08 21:13:35,513 INFO Configuration.deprecation: 
mapred.output.compression.type is deprecated. Instead, use 
mapreduce.output.fileoutputformat.compress.type
2024-04-08 21:13:35,750 INFO util.Utils: Successfully started service 
'sparkDriver' on port 42169.
2024-04-08 21:13:35,789 INFO spark.SparkEnv: Registering MapOutputTracker
2024-04-08 21:13:35,826 INFO spark.SparkEnv: Registering BlockManagerMaster
2024-04-08 21:13:35,848 INFO storage.BlockManagerMasterEndpoint: Using 
org.apache.spark.storage.DefaultTopologyMapper for getting topology information
2024-04-08 21:13:35,850 INFO storage.BlockManagerMasterEndpoint: 
BlockManagerMasterEndpoint up
2024-04-08 21:13:35,856 INFO spark.SparkEnv: Registering 
BlockManagerMasterHeartbeat
2024-04-08 21:13:35,879 INFO storage.DiskBlockManager: Created local directory 
at /tmp/blockmgr-2e2fda2c-c1b4-4198-b790-58c00db5af27
2024-04-08 21:13:35,900 INFO memory.MemoryStore: MemoryStore started with 
capacity 366.3 MiB
2024-04-08 21:13:35,915 INFO spark.SparkEnv: Registering OutputCommitCoordinator
2024-04-08 21:13:36,009 INFO util.log: Logging initialized @2972ms to 
org.sparkproject.jetty.util.log.Slf4jLog
2024-04-08 21:13:36,135 INFO server.Server: jetty-9.4.43.v20210629; built: 
2021-06-30T11:07:22.254Z; git: 526006ecfa3af7f1a27ef3a288e2bef7ea9dd7e8; jvm 
1.8.0_212-b04
2024-04-08 21:13:36,162 INFO server.Server: Started @3125ms
2024-04-08 21:13:36,198 INFO server.AbstractConnector: Started 
ServerConnector@3e681bc{HTTP/1.1, (http/1.1)}{0.0.0.0:8090}
2024-04-08 21:13:36,199 INFO util.Utils: Successfully started service 'SparkUI' 
on port 8090.
2024-04-08 21:13:36,241 INFO handler.ContextHandler: Started 
o.s.j.s.ServletContextHandler@55b62629{/jobs,null,AVAILABLE,@Spark}
2024-04-08 21:13:36,244 INFO handler.ContextHandler: Started 
o.s.j.s.ServletContextHandler@15f193b8{/jobs/json,null,AVAILABLE,@Spark}
2024-04-08 21:13:36,245 INFO 

Re: [I] [SUPPORT]insert_overwrite_table table slow [hudi]

2024-04-08 Thread via GitHub


wkhappy1 commented on issue #10979:
URL: https://github.com/apache/hudi/issues/10979#issuecomment-2043650074

   @ad1happy2go yes,table size is 27.1 G that is a the hudi table in hdfs ,and 
I find a rdd cache on disk size is 503.8 from spark ui.can the rdd size cached 
be small?it seems to bigger


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7503] Compaction and LogCompaction executions should start a heartbeat on every attempt and block concurrent executions of same plan [hudi]

2024-04-08 Thread via GitHub


hudi-bot commented on PR #10965:
URL: https://github.com/apache/hudi/pull/10965#issuecomment-2043606333

   
   ## CI report:
   
   * c8e268903a19c7ecc5cd927fd8afa3332a1c3aea Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23133)
 
   * c41af6435281865147967768419da5e4fb688f8b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23153)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7503] Compaction and LogCompaction executions should start a heartbeat on every attempt and block concurrent executions of same plan [hudi]

2024-04-08 Thread via GitHub


kbuci commented on code in PR #10965:
URL: https://github.com/apache/hudi/pull/10965#discussion_r1556395911


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java:
##
@@ -1135,8 +1138,34 @@ protected void 
completeLogCompaction(HoodieCommitMetadata metadata, HoodieTable
*/
   protected HoodieWriteMetadata compact(String compactionInstantTime, 
boolean shouldComplete) {
 HoodieTable table = createTable(config, context.getHadoopConf().get());
+Option instantToCompactOption = 
Option.fromJavaOptional(table.getActiveTimeline()
+.filterCompletedAndCompactionInstants()
+.getInstants()
+.stream()
+.filter(instant -> 
HoodieActiveTimeline.EQUALS.test(instant.getTimestamp(), compactionInstantTime))
+.findFirst());
+try {
+  // Transaction serves to ensure only one compact job for this instant 
will start heartbeat, and any other concurrent
+  // compact job will abort if they attempt to execute compact before 
heartbeat expires
+  // Note that as long as all jobs for this table use this API for 
compact, then this alone should prevent
+  // compact rollbacks from running concurrently to compact commits.
+  txnManager.beginTransaction(instantToCompactOption, 
txnManager.getLastCompletedTransactionOwner());
+  try {
+if (!this.heartbeatClient.isHeartbeatExpired(compactionInstantTime)) {
+  throw new HoodieLockException("Cannot compact instant " + 
compactionInstantTime + " due to heartbeat by existing job");
+}
+  } catch (IOException e) {
+throw new HoodieHeartbeatException("Error accessing heartbeat of 
instant to compact " + compactionInstantTime, e);
+  }
+  this.heartbeatClient.start(compactionInstantTime);
+} finally {
+  txnManager.endTransaction(txnManager.getCurrentTransactionOwner());
+}
 preWrite(compactionInstantTime, WriteOperationType.COMPACT, 
table.getMetaClient());
-return tableServiceClient.compact(compactionInstantTime, shouldComplete);
+HoodieWriteMetadata compactMetadata = 
tableServiceClient.compact(compactionInstantTime, shouldComplete);
+this.heartbeatClient.stop(compactionInstantTime, true);

Review Comment:
   I was looking into a UT failure in 
`org.apache.hudi.table.functional.TestHoodieSparkMergeOnReadTableInsertUpdateDelete#testRepeatedRollbackOfCompaction`
 where two compact executions of the same instant time are called back to back 
(my understanding is that this is supposed to verify that the second compact 
does a no-op and succeeds upon seeing that plan is already committed). 
   I realized that with this change, the second compact call was failing due to 
calling `isHeartbeatExpired` and seeing an active heartbeat (from the first 
attempt) still running, despite the fact that here we are stopping the 
heartbeat after a successfully completing the compact. The reason that 
`isHeartbeatExpired` was unexpectedly `false` here is that
   1.  `isHeartbeatExpired` will return false if instant time is too recent, 
even if the heartbeat has been stopped (in the in-memory mapping)
   2. When 
`org.apache.hudi.client.heartbeat.HoodieHeartbeatClient#stop(java.lang.String)` 
is called (by the first compact call in UT) the heartbeat file is deleted and 
the heartbeat in in-memory mapping is stopped (as expected). But this means 
that the heartbeat cannot be started again (even if (1) is resolved), since 
heartbeat API doesn't allow caller to start a heartbeat that is present in 
in-memory mapping and has heartbeatStopped flag set to true.
   
   In order to get around this issue, I added another API in heartbeat API 
similar to stop, except that it removes the desired heartbeat from the 
in-memory mapping (forcing any future compact call in the same job to re-read 
the heartbeat files from DFS and create a new heartbeat in the in-memory 
mapping ). Though not sure if there might be a better approach here. I assume 
this existing functionality isn't a bug, as it makes sense for commits that 
cannot be repeatedly re-executed  (like ingestion COMMITs). I wonder if the 
reason that this is causing an issue here stems from the fact that for compact 
we need to potentially repeatedly restart stopped heartbeats, and the 
hearttbeat API might not have been intended for this use case?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7503] Compaction and LogCompaction executions should start a heartbeat on every attempt and block concurrent executions of same plan [hudi]

2024-04-08 Thread via GitHub


hudi-bot commented on PR #10965:
URL: https://github.com/apache/hudi/pull/10965#issuecomment-2043593538

   
   ## CI report:
   
   * c8e268903a19c7ecc5cd927fd8afa3332a1c3aea Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23133)
 
   * c41af6435281865147967768419da5e4fb688f8b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7290] Don't assume ReplaceCommits are always Clustering [hudi]

2024-04-08 Thread via GitHub


bvaradar commented on code in PR #10479:
URL: https://github.com/apache/hudi/pull/10479#discussion_r1556399110


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/marker/WriteMarkers.java:
##
@@ -86,7 +86,7 @@ public Option create(String partitionPath, String 
fileName, IOType type, H
   HoodieTimeline pendingReplaceTimeline = 
activeTimeline.filterPendingReplaceTimeline();
   // TODO If current is compact or clustering then create marker directly 
without early conflict detection.
   // Need to support early conflict detection between table service and 
common writers.
-  if (pendingCompactionTimeline.containsInstant(instantTime) || 
pendingReplaceTimeline.containsInstant(instantTime)) {

Review Comment:
   @jonvex : Wouldn't this cause extra compaction plan read at each writing 
task level ? Instead, can you see if we can pass this information from the 
driver ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7290] Don't assume ReplaceCommits are always Clustering [hudi]

2024-04-08 Thread via GitHub


bvaradar commented on code in PR #10479:
URL: https://github.com/apache/hudi/pull/10479#discussion_r1556399888


##
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieDefaultTimeline.java:
##
@@ -516,13 +516,40 @@ public Option getLastClusteringInstant() {
 .findFirst());
   }
 
+  @Override
+  public Option getFirstPendingClusterInstant() {
+return getLastOrFirstPendingClusterInstant(false);
+  }
+
   @Override
   public Option getLastPendingClusterInstant() {
-return  Option.fromJavaOptional(filterPendingReplaceTimeline()
-.getReverseOrderedInstants()
+return getLastOrFirstPendingClusterInstant(true);
+  }
+
+  protected Option getLastOrFirstPendingClusterInstant(boolean 
getLast) {

Review Comment:
   Make this private. 
   
   Rename getLast to isLast



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7503] Compaction and LogCompaction executions should start a heartbeat on every attempt and block concurrent executions of same plan [hudi]

2024-04-08 Thread via GitHub


kbuci commented on code in PR #10965:
URL: https://github.com/apache/hudi/pull/10965#discussion_r1556395911


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java:
##
@@ -1135,8 +1138,34 @@ protected void 
completeLogCompaction(HoodieCommitMetadata metadata, HoodieTable
*/
   protected HoodieWriteMetadata compact(String compactionInstantTime, 
boolean shouldComplete) {
 HoodieTable table = createTable(config, context.getHadoopConf().get());
+Option instantToCompactOption = 
Option.fromJavaOptional(table.getActiveTimeline()
+.filterCompletedAndCompactionInstants()
+.getInstants()
+.stream()
+.filter(instant -> 
HoodieActiveTimeline.EQUALS.test(instant.getTimestamp(), compactionInstantTime))
+.findFirst());
+try {
+  // Transaction serves to ensure only one compact job for this instant 
will start heartbeat, and any other concurrent
+  // compact job will abort if they attempt to execute compact before 
heartbeat expires
+  // Note that as long as all jobs for this table use this API for 
compact, then this alone should prevent
+  // compact rollbacks from running concurrently to compact commits.
+  txnManager.beginTransaction(instantToCompactOption, 
txnManager.getLastCompletedTransactionOwner());
+  try {
+if (!this.heartbeatClient.isHeartbeatExpired(compactionInstantTime)) {
+  throw new HoodieLockException("Cannot compact instant " + 
compactionInstantTime + " due to heartbeat by existing job");
+}
+  } catch (IOException e) {
+throw new HoodieHeartbeatException("Error accessing heartbeat of 
instant to compact " + compactionInstantTime, e);
+  }
+  this.heartbeatClient.start(compactionInstantTime);
+} finally {
+  txnManager.endTransaction(txnManager.getCurrentTransactionOwner());
+}
 preWrite(compactionInstantTime, WriteOperationType.COMPACT, 
table.getMetaClient());
-return tableServiceClient.compact(compactionInstantTime, shouldComplete);
+HoodieWriteMetadata compactMetadata = 
tableServiceClient.compact(compactionInstantTime, shouldComplete);
+this.heartbeatClient.stop(compactionInstantTime, true);

Review Comment:
   I was looking into a UT failure in 
`org.apache.hudi.table.functional.TestHoodieSparkMergeOnReadTableInsertUpdateDelete#testRepeatedRollbackOfCompaction`
 where two compact executions of the same instant time are called back to back 
(my understanding is that this is supposed to verify that the second compact 
does a no-op and succeeds upon seeing that plan is already committed). 
   I realized that with this change, the second compact call was failing due to 
calling `isHeartbeatExpired` and seeing an active heartbeat (from the first 
attempt) still running, despite the fact that here we are stopping the 
heartbeat after a successfully completing the compact. The reason that 
`isHeartbeatExpired` was unexpectedly `false` here is that
   1.  `isHeartbeatExpired` will return false if instant time is too recent, 
even if the heartbeat has been stopped (in the in-memory mapping)
   2. When 
`org.apache.hudi.client.heartbeat.HoodieHeartbeatClient#stop(java.lang.String)` 
is called (by the first compact call in UT) the heartbeat file is deleted and 
the heartbeat in in-memory mapping is stopped (as expected). But this means 
that the heartbeat cannot be started again (even if (1) is resolved), since 
heartbeat API doesn't allow caller to start a heartbeat that is present in 
in-memory mapping and has heartbeatStopped flag set to true.
   
   In order to get around this issue, I added another API in heartbeat API 
similar to stop, except that it removes the desired heartbeat from the 
in-memory mapping (forcing any future compact call in the same job to re-read 
the heartbeat files from DFS and create a new heartbeat in the in-memory 
mapping ). Though not sure if there might be a better approach here. I assume 
this existing functionality isn't a bug, as it makes sense for commits that 
cannot be repeatedly re-executed  (like ingestion COMMITs), and I assume the 
issue here stems from the fact that for compact we need to potentially 
repeatedly restart stopped heartbeats



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7290] Don't assume ReplaceCommits are always Clustering [hudi]

2024-04-08 Thread via GitHub


hudi-bot commented on PR #10479:
URL: https://github.com/apache/hudi/pull/10479#issuecomment-2043517085

   
   ## CI report:
   
   * b9b3ae4c3025515e61eca8a7df887eb9fe764b0f Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23151)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7290] Don't assume ReplaceCommits are always Clustering [hudi]

2024-04-08 Thread via GitHub


hudi-bot commented on PR #10479:
URL: https://github.com/apache/hudi/pull/10479#issuecomment-2043429116

   
   ## CI report:
   
   * 0a5e5faa01273113cb974e9aa31cfb54d62dff67 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23150)
 
   * b9b3ae4c3025515e61eca8a7df887eb9fe764b0f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23151)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7290] Don't assume ReplaceCommits are always Clustering [hudi]

2024-04-08 Thread via GitHub


hudi-bot commented on PR #10479:
URL: https://github.com/apache/hudi/pull/10479#issuecomment-2043418339

   
   ## CI report:
   
   * 0a5e5faa01273113cb974e9aa31cfb54d62dff67 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23150)
 
   * b9b3ae4c3025515e61eca8a7df887eb9fe764b0f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]insert_overwrite_table table slow [hudi]

2024-04-08 Thread via GitHub


ad1happy2go commented on issue #10979:
URL: https://github.com/apache/hudi/issues/10979#issuecomment-2043307873

   @wkhappy1 As you said the table size is 27.1 G, is it parquet table?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7290] Don't assume ReplaceCommits are always Clustering [hudi]

2024-04-08 Thread via GitHub


jonvex commented on PR #10479:
URL: https://github.com/apache/hudi/pull/10479#issuecomment-2043297073

   @bvaradar 
org.apache.hudi.common.table.view.TestHoodieTableFileSystemView#testHoodieTableFileSystemViewWithPendingClustering
 is failing because that test relies on this feature to be broken


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7503] Compaction and LogCompaction executions should start a heartbeat on every attempt and block concurrent executions of same plan [hudi]

2024-04-08 Thread via GitHub


kbuci commented on code in PR #10965:
URL: https://github.com/apache/hudi/pull/10965#discussion_r1556157237


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java:
##
@@ -1135,8 +1138,36 @@ protected void 
completeLogCompaction(HoodieCommitMetadata metadata, HoodieTable
*/
   protected HoodieWriteMetadata compact(String compactionInstantTime, 
boolean shouldComplete) {
 HoodieTable table = createTable(config, context.getHadoopConf().get());
+Option instantToCompactOption = 
Option.fromJavaOptional(table.getActiveTimeline()
+.filterCompletedAndCompactionInstants()
+.getInstants()
+.stream()
+.filter(instant -> 
HoodieActiveTimeline.EQUALS.test(instant.getTimestamp(), compactionInstantTime))
+.findFirst());
+try {
+  // Transaction serves to ensure only one compact job for this instant 
will start heartbeat, and any other concurrent
+  // compact job will abort if they attempt to execute compact before 
heartbeat expires
+  // Note that as long as all jobs for this table use this API for 
compact, then this alone should prevent
+  // compact rollbacks from running concurrently to compact commits.
+  txnManager.beginTransaction(instantToCompactOption, 
txnManager.getLastCompletedTransactionOwner());

Review Comment:
   > 1.2) if the heartbeat does not expire, just can the execution of this run 
and log a wanning log there.
   Just to clarify, do you mean throwing an exception in this run? Since I'm 
not sure if we can make the current run a no-op if a concurrent heartbeat is 
detected, since 
   a) I'm not sure what HoodieWriteMetadata value to return if we make this a 
no-op
   b) If we don't explicitly throw an exception and fail, then the caller will 
assume compaction happened succesfully or already happened. This wouldn't be a 
correct assumption, since the other concurrent writer (that is currently 
executing this compact plan) may either fail or take a long time to finish.
   
   > if the state if still REQUESTED, we can execute it direcly?
   Ah that's a good point, it might actually be safe for two jobs to execute a 
compact plan at same time as long as neither of them are doing a roll back. 
Despite that though, I don't think it's safe to skip acquiring+starting 
heartbeat even if the compact plan only has .requested , since the following 
(unlikely) scenario can happen still:
   1. Table has a compact plan C.requested created in timeline
   2. Job (A) calls compact on C. It starts a heartbeat of C and then starts 
executing C
   3. Job (B) calls compact on C. Although it sees a heartbeat for C, since C 
has no C.inflight it starts executing C
   4. Job (A) and/or Job (B) create a C.inflight
   5. Job (A) fails. 
   6. Heartbeat that Job (A) created expires
   7. Job (C) calls compact on C. It sees that there is a C.inflight and no 
heartbeat (because Job (B) did not start any heartbeat). Therefore, it starts 
executing C, and rolls back the existing C.inflight
   



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java:
##
@@ -1135,8 +1138,36 @@ protected void 
completeLogCompaction(HoodieCommitMetadata metadata, HoodieTable
*/
   protected HoodieWriteMetadata compact(String compactionInstantTime, 
boolean shouldComplete) {
 HoodieTable table = createTable(config, context.getHadoopConf().get());
+Option instantToCompactOption = 
Option.fromJavaOptional(table.getActiveTimeline()
+.filterCompletedAndCompactionInstants()
+.getInstants()
+.stream()
+.filter(instant -> 
HoodieActiveTimeline.EQUALS.test(instant.getTimestamp(), compactionInstantTime))
+.findFirst());
+try {
+  // Transaction serves to ensure only one compact job for this instant 
will start heartbeat, and any other concurrent
+  // compact job will abort if they attempt to execute compact before 
heartbeat expires
+  // Note that as long as all jobs for this table use this API for 
compact, then this alone should prevent
+  // compact rollbacks from running concurrently to compact commits.
+  txnManager.beginTransaction(instantToCompactOption, 
txnManager.getLastCompletedTransactionOwner());

Review Comment:
   > 1.2) if the heartbeat does not expire, just can the execution of this run 
and log a wanning log there.
   
   Just to clarify, do you mean throwing an exception in this run? Since I'm 
not sure if we can make the current run a no-op if a concurrent heartbeat is 
detected, since 
   a) I'm not sure what HoodieWriteMetadata value to return if we make this a 
no-op
   b) If we don't explicitly throw an exception and fail, then the caller will 
assume compaction happened succesfully or already happened. This wouldn't be a 
correct assumption, since the other concurrent writer (that is currently 
executing this compact plan) may either fail or take a long time 

Re: [PR] [HUDI-7290] Don't assume ReplaceCommits are always Clustering [hudi]

2024-04-08 Thread via GitHub


hudi-bot commented on PR #10479:
URL: https://github.com/apache/hudi/pull/10479#issuecomment-2043184042

   
   ## CI report:
   
   * 0a5e5faa01273113cb974e9aa31cfb54d62dff67 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23150)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Assigned] (HUDI-6330) Update user document to introduce this feature

2024-04-08 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu reassigned HUDI-6330:


Assignee: Jing Zhang

> Update user document to introduce this feature
> --
>
> Key: HUDI-6330
> URL: https://issues.apache.org/jira/browse/HUDI-6330
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: docs, flink
>Reporter: Jing Zhang
>Assignee: Jing Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-6330) Update user document to introduce this feature

2024-04-08 Thread Raymond Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834990#comment-17834990
 ] 

Raymond Xu commented on HUDI-6330:
--

[~jingzhang] thanks and merged!

> Update user document to introduce this feature
> --
>
> Key: HUDI-6330
> URL: https://issues.apache.org/jira/browse/HUDI-6330
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: docs, flink
>Reporter: Jing Zhang
>Assignee: Jing Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-6330) Update user document to introduce this feature

2024-04-08 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu closed HUDI-6330.

Resolution: Fixed

> Update user document to introduce this feature
> --
>
> Key: HUDI-6330
> URL: https://issues.apache.org/jira/browse/HUDI-6330
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: docs, flink
>Reporter: Jing Zhang
>Assignee: Jing Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


(hudi) branch asf-site updated: [HUDI-6330][DOCS] Update user doc to show how to use consistent bucket index for Flink engine (#10977)

2024-04-08 Thread xushiyan
This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 72b01a53d3d [HUDI-6330][DOCS] Update user doc to show how to use 
consistent bucket index for Flink engine (#10977)
72b01a53d3d is described below

commit 72b01a53d3d22a51e9210b4b69f368e7388821e4
Author: Jing Zhang 
AuthorDate: Tue Apr 9 00:18:41 2024 +0800

[HUDI-6330][DOCS] Update user doc to show how to use consistent bucket 
index for Flink engine (#10977)
---
 website/docs/sql_dml.md| 80 --
 website/releases/release-0.14.0.md |  4 +-
 2 files changed, 78 insertions(+), 6 deletions(-)

diff --git a/website/docs/sql_dml.md b/website/docs/sql_dml.md
index 90576dcb0e0..edb63730b13 100644
--- a/website/docs/sql_dml.md
+++ b/website/docs/sql_dml.md
@@ -323,12 +323,15 @@ In the below example, we have two streaming ingestion 
pipelines that concurrentl
 pipeline is responsible for the compaction and cleaning table services, while 
the other pipeline is just for data
 ingestion.
 
-```sql
+In order to commit the dataset, the checkpoint needs to be enabled, here is an 
example configuration for a flink-conf.yaml:
+```yaml
 -- set the interval as 30 seconds
 execution.checkpointing.interval: 3
 state.backend: rocksdb
+```
 
--- This is a datagen source that can generates records continuously
+```sql
+-- This is a datagen source that can generate records continuously
 CREATE TABLE sourceT (
 uuid varchar(20),
 name varchar(10),
@@ -349,7 +352,7 @@ CREATE TABLE t1(
 `partition` varchar(20)
 ) WITH (
 'connector' = 'hudi',
-'path' = '/Users/chenyuzhao/workspace/hudi-demo/t1',
+'path' = '${work_path}/hudi-demo/t1',
 'table.type' = 'MERGE_ON_READ',
 'index.type' = 'BUCKET',
 'hoodie.write.concurrency.mode' = 'NON_BLOCKING_CONCURRENCY_CONTROL',
@@ -365,7 +368,7 @@ CREATE TABLE t1_2(
 `partition` varchar(20)
 ) WITH (
 'connector' = 'hudi',
-'path' = '/Users/chenyuzhao/workspace/hudi-demo/t1',
+'path' = '${work_path}/hudi-demo/t1',
 'table.type' = 'MERGE_ON_READ',
 'index.type' = 'BUCKET',
 'hoodie.write.concurrency.mode' = 'NON_BLOCKING_CONCURRENCY_CONTROL',
@@ -390,3 +393,72 @@ and `clean.async.enabled` options are used to disable the 
compaction and cleanin
 This is done to ensure that the compaction and cleaning services are not 
executed twice for the same table.
 
 
+### Consistent hashing index (Experimental)
+
+We have introduced the Consistent Hashing Index since [0.13.0 
release](/releases/release-0.13.0#consistent-hashing-index). In comparison to 
the static hashing index ([Bucket 
Index](/releases/release-0.11.0#bucket-index)), the consistent hashing index 
offers dynamic scalability of data buckets for the writer. 
+You can find the 
[RFC](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) for the 
design of this feature.
+In the 0.13.X release, the Consistent Hashing Index is supported only for 
Spark engine. And since [release 
0.14.0](/releases/release-0.14.0#consistent-hashing-index-support), the index 
is supported for Flink engine.
+
+To utilize this feature, configure the option `index.type` as `BUCKET` and set 
`hoodie.index.bucket.engine` to `CONSISTENT_HASHING`.
+When enabling the consistent hashing index, it's important to enable 
clustering scheduling within the writer. During this process, the writer will 
perform dual writes for both the old and new data buckets while the clustering 
is pending. Although the dual write does not impact correctness, it is strongly 
recommended to execute clustering as quickly as possible.
+
+In the below example, we will create a datagen source and do streaming 
ingestion into Hudi table with consistent bucket index. In order to commit the 
dataset, the checkpoint needs to be enabled, here is an example configuration 
for a flink-conf.yaml:
+```yaml
+-- set the interval as 30 seconds
+execution.checkpointing.interval: 3
+state.backend: rocksdb
+```
+
+```sql
+-- This is a datagen source that can generate records continuously
+CREATE TABLE sourceT (
+uuid varchar(20),
+name varchar(10),
+age int,
+ts timestamp(3),
+`partition` as 'par1'
+) WITH (
+'connector' = 'datagen',
+'rows-per-second' = '200'
+);
+
+-- Create the hudi table with consistent bucket index
+CREATE TABLE t1(
+uuid VARCHAR(20) PRIMARY KEY NOT ENFORCED,
+name VARCHAR(10),
+age INT,
+ts TIMESTAMP(3),
+`partition` VARCHAR(20)
+)
+PARTITIONED BY (`partition`)
+WITH (
+'connector'='hudi',
+'path' = '${work_path}/hudi-demo/hudiT',
+'table.type' = 'MERGE_ON_READ',
+'index.type' = 'BUCKET',
+'clustering.schedule.enabled'='true',
+'hoodie.index.bucket.engine'='CONSISTENT_HASHING',
+

Re: [PR] [HUDI-6330][DOCS] Update user doc to show how to use consistent bucket index for Flink engine [hudi]

2024-04-08 Thread via GitHub


xushiyan merged PR #10977:
URL: https://github.com/apache/hudi/pull/10977


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7290] Don't assume ReplaceCommits are always Clustering [hudi]

2024-04-08 Thread via GitHub


hudi-bot commented on PR #10479:
URL: https://github.com/apache/hudi/pull/10479#issuecomment-2043093848

   
   ## CI report:
   
   * 52afba2aa7c6ec4e0f8ca0f50eaf4a0639c53432 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21909)
 
   * 85e5016a10f9908c8116cd950dc46bbf74a8a558 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23149)
 
   * 0a5e5faa01273113cb974e9aa31cfb54d62dff67 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7290] Don't assume ReplaceCommits are always Clustering [hudi]

2024-04-08 Thread via GitHub


hudi-bot commented on PR #10479:
URL: https://github.com/apache/hudi/pull/10479#issuecomment-2043078181

   
   ## CI report:
   
   * 52afba2aa7c6ec4e0f8ca0f50eaf4a0639c53432 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21909)
 
   * 85e5016a10f9908c8116cd950dc46bbf74a8a558 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23149)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch asf-site updated: [DOCS] Update blogs (#10971)

2024-04-08 Thread xushiyan
This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 06eb97ca409 [DOCS] Update blogs (#10971)
06eb97ca409 is described below

commit 06eb97ca4093dc2069eadd008392c71e3a15ef90
Author: Bhavani Sudha Saktheeswaran <2179254+bhasu...@users.noreply.github.com>
AuthorDate: Mon Apr 8 08:35:40 2024 -0700

[DOCS] Update blogs (#10971)
---
 ...olutionary-journey-of-upstoxs-data-platform.mdx |  17 
 ...-Modern-Datalakes-with-Hudi--MinIO--and-HMS.mdx |  20 +++
 ...3-22-data-lake-cost-optimisation-strategies.mdx |  22 +
 ...able-formats-apache-iceberg-and-apache-hudi.mdx |  19 ++
 ...dexing-apache-hudi-delivers-70-faster-point.mdx |  19 ++
 ...reading-data-from-hudi-tables-joining-delta.mdx |  21 
 ...olutionary-journey-of-upstoxs-data-platform.png | Bin 0 -> 454009 bytes
 ...-Modern-Datalakes-with-Hudi--MinIO--and-HMS.jpg | Bin 0 -> 55802 bytes
 ...3-22-data-lake-cost-optimisation-strategies.png | Bin 0 -> 202437 bytes
 ...able-formats-apache-iceberg-and-apache-hudi.png | Bin 0 -> 488494 bytes
 ...dexing-apache-hudi-delivers-70-faster-point.png | Bin 0 -> 139221 bytes
 ...reading-data-from-hudi-tables-joining-delta.png | Bin 0 -> 92968 bytes
 12 files changed, 118 insertions(+)

diff --git 
a/website/blog/2024-03-10-navigating-the-future-the-evolutionary-journey-of-upstoxs-data-platform.mdx
 
b/website/blog/2024-03-10-navigating-the-future-the-evolutionary-journey-of-upstoxs-data-platform.mdx
new file mode 100644
index 000..ac2a5a2ad3f
--- /dev/null
+++ 
b/website/blog/2024-03-10-navigating-the-future-the-evolutionary-journey-of-upstoxs-data-platform.mdx
@@ -0,0 +1,17 @@
+---
+title: "Navigating the Future: The Evolutionary Journey of Upstox’s Data 
Platform"
+author: Manish Gaurav
+category: blog
+image: 
/assets/images/blog/2024-03-10-navigating-the-future-the-evolutionary-journey-of-upstoxs-data-platform.png
+tags:
+- use-case
+- apache hudi
+- upstox-engineering
+---
+
+
+
+import Redirect from '@site/src/components/Redirect';
+
+https://medium.com/upstox-engineering/navigating-the-future-the-evolutionary-journey-of-upstoxs-data-platform-92dc10ff22ae;>Redirecting...
 please wait!! 
+
diff --git 
a/website/blog/2024-03-14-Modern-Datalakes-with-Hudi--MinIO--and-HMS.mdx 
b/website/blog/2024-03-14-Modern-Datalakes-with-Hudi--MinIO--and-HMS.mdx
new file mode 100644
index 000..915b2426f0d
--- /dev/null
+++ b/website/blog/2024-03-14-Modern-Datalakes-with-Hudi--MinIO--and-HMS.mdx
@@ -0,0 +1,20 @@
+---
+title: "Modern Datalakes with Hudi, MinIO, and HMS"
+author: Brenna Buuck
+category: blog
+image: 
/assets/images/blog/2024-03-14-Modern-Datalakes-with-Hudi--MinIO--and-HMS.jpg
+tags:
+- blog
+- apache hudi
+- minio
+- hms
+- hive metastore
+- min
+---
+
+
+
+import Redirect from '@site/src/components/Redirect';
+
+https://blog.min.io/datalakes-with-hudi-and-hms/;>Redirecting... please 
wait!! 
+
diff --git a/website/blog/2024-03-22-data-lake-cost-optimisation-strategies.mdx 
b/website/blog/2024-03-22-data-lake-cost-optimisation-strategies.mdx
new file mode 100644
index 000..351bf85b25c
--- /dev/null
+++ b/website/blog/2024-03-22-data-lake-cost-optimisation-strategies.mdx
@@ -0,0 +1,22 @@
+---
+title: "Cost Optimization Strategies for scalable Data Lakehouse"
+author: Suresh Hasundi
+category: blog
+image: 
/assets/images/blog/2024-03-22-data-lake-cost-optimisation-strategies.png
+tags:
+- blog
+- apache hudi
+- amazon s3
+- amazon emr
+- apcache spark
+- lakehouse
+- cost optimization
+- halodoc
+---
+
+
+
+import Redirect from '@site/src/components/Redirect';
+
+https://blogs.halodoc.io/data-lake-cost-optimisation-strategies/;>Redirecting...
 please wait!! 
+
diff --git 
a/website/blog/2024-03-23-options-on-kafka-sink-to-open-table-formats-apache-iceberg-and-apache-hudi.mdx
 
b/website/blog/2024-03-23-options-on-kafka-sink-to-open-table-formats-apache-iceberg-and-apache-hudi.mdx
new file mode 100644
index 000..0a3e0050139
--- /dev/null
+++ 
b/website/blog/2024-03-23-options-on-kafka-sink-to-open-table-formats-apache-iceberg-and-apache-hudi.mdx
@@ -0,0 +1,19 @@
+---
+title: "Options on Kafka sink to open table Formats: Apache Iceberg and Apache 
Hudi"
+author: Albert Wong
+category: blog
+image: 
/assets/images/blog/2024-03-23-options-on-kafka-sink-to-open-table-formats-apache-iceberg-and-apache-hudi.png
+tags:
+- blog
+- apache hudi
+- apache iceberg
+- apache Kafka
+- kafka connect
+- starrocks
+- devgenius
+---
+
+import Redirect from '@site/src/components/Redirect';
+
+https://blog.devgenius.io/options-on-kafka-sink-to-open-table-formats-apache-iceberg-and-apache-hudi-f6839ddad978;>Redirecting...
 please wait!! 
+
diff --git 

Re: [PR] [DOCS] Update blogs [hudi]

2024-04-08 Thread via GitHub


xushiyan merged PR #10971:
URL: https://github.com/apache/hudi/pull/10971


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7290] Don't assume ReplaceCommits are always Clustering [hudi]

2024-04-08 Thread via GitHub


hudi-bot commented on PR #10479:
URL: https://github.com/apache/hudi/pull/10479#issuecomment-2043060978

   
   ## CI report:
   
   * 52afba2aa7c6ec4e0f8ca0f50eaf4a0639c53432 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21909)
 
   * 85e5016a10f9908c8116cd950dc46bbf74a8a558 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7576] add partitionPath as an instance variable to HoodieBaseFile and HoodieLogFile [hudi]

2024-04-08 Thread via GitHub


the-other-tim-brown commented on PR #10975:
URL: https://github.com/apache/hudi/pull/10975#issuecomment-2042932190

   > > > Can you explain why?
   > > 
   > > 
   > > Because it represents an "File", the partition notion kind of belongs to 
table, which is firstly introduced by Hive to resolve the scalability issues.
   > 
   > Ok why does it contain commit and file group? The logic presented here 
does not seem to apply to the existing class. This class contains metadata 
relevant to grouping the files with other relevant files.
   > 
   > If the issue this is too big a change to introduce, I can look for other 
options but I think there needs to be some consistency in what is added to 
these classes. @nsivabalan and @yihua let me know what you would prefer as well.
   
   I can also put up a draft where I limit the changes to the 
`AbstractTableFileSystemView` and some of the supporting utils which will 
decrease the size of the PR if we want to punt on this discussion


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7575] avoid repeated fetching of pending replace instants [hudi]

2024-04-08 Thread via GitHub


the-other-tim-brown commented on code in PR #10976:
URL: https://github.com/apache/hudi/pull/10976#discussion_r1555936753


##
hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java:
##
@@ -140,6 +141,22 @@ protected void init(HoodieTableMetaClient metaClient, 
HoodieTimeline visibleActi
*/
   protected void refreshTimeline(HoodieTimeline visibleActiveTimeline) {
 this.visibleCommitsAndCompactionTimeline = 
visibleActiveTimeline.getWriteTimeline();
+this.timelineHashAndPendingReplaceInstants = null;
+  }
+
+  /**
+   * Get a list of pending replace instants. Caches the result for the active 
timeline.
+   * The cache is invalidated when {@link #refreshTimeline(HoodieTimeline)} is 
called.
+   *
+   * @return list of pending replace instant timestamps
+   */
+  private List getPendingReplaceInstants() {
+HoodieActiveTimeline activeTimeline = metaClient.getActiveTimeline();

Review Comment:
   Regarding threading, should we just make this whole method synchronized?
   
   Regarding caching, I'm open to whatever seems best. I noticed that the cache 
in HoodieDefaultTimeline is limited to a single instance and not a global 
cache. It seems like it may make sense long term to return the same instance 
whenever possible to benefit from this cache.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]The number of tasks in each distinct stage of building workload profile is always 60 [hudi]

2024-04-08 Thread via GitHub


MrAladdin closed issue #10972: [SUPPORT]The number of tasks in each distinct 
stage of building workload profile is always 60
URL: https://github.com/apache/hudi/issues/10972


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]The number of tasks in each distinct stage of building workload profile is always 60 [hudi]

2024-04-08 Thread via GitHub


MrAladdin commented on issue #10972:
URL: https://github.com/apache/hudi/issues/10972#issuecomment-2042898188

   > @MrAladdin Can you provide the writer configurations you are 
using?@MrAladdin 你能提供你正在使用的写入器配置吗?
   
   sorry, Forget to close "hoodie.metadata.index.async"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-08 Thread via GitHub


hudi-bot commented on PR #10980:
URL: https://github.com/apache/hudi/pull/10980#issuecomment-2042891233

   
   ## CI report:
   
   * 36b0e8f8e5e00096b9844f8db6cc51cbc114f42c Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23148)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Nested object support in Hudi Table using Flink [hudi]

2024-04-08 Thread via GitHub


waytoharish closed issue #10895: Nested object support in Hudi Table using Flink
URL: https://github.com/apache/hudi/issues/10895


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Nested object support in Hudi Table using Flink [hudi]

2024-04-08 Thread via GitHub


waytoharish commented on issue #10895:
URL: https://github.com/apache/hudi/issues/10895#issuecomment-2042890195

   Thanks @ad1happy2go @danny0405 its worked for me after using the 
GenericRowData. 
   
   I am closing the issue 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7576] add partitionPath as an instance variable to HoodieBaseFile and HoodieLogFile [hudi]

2024-04-08 Thread via GitHub


the-other-tim-brown commented on PR #10975:
URL: https://github.com/apache/hudi/pull/10975#issuecomment-2042690168

   > > Can you explain why?
   > 
   > Because it represents an "File", the partition notion kind of belongs to 
table, which is firstly introduced by Hive to resolve the scalability issues.
   
   Ok why does it contain commit and file group? The logic presented here does 
not seem to apply to the existing class. This class contains metadata relevant 
to grouping the files with other relevant files. 
   
   If the issue this is too big a change to introduce, I can look for other 
options but I think there needs to be some consistency in what is added to 
these classes. @nsivabalan  and @yihua  let me know what you would prefer as 
well. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]insert_overwrite_table table slow [hudi]

2024-04-08 Thread via GitHub


wkhappy1 commented on issue #10979:
URL: https://github.com/apache/hudi/issues/10979#issuecomment-2042672365

   @ad1happy2go input data is a dataframe compute from other tables


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-08 Thread via GitHub


hudi-bot commented on PR #10980:
URL: https://github.com/apache/hudi/pull/10980#issuecomment-2042643487

   
   ## CI report:
   
   * 07e398007c1557d3e17adc3d8a36d8778ed3e976 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23147)
 
   * 36b0e8f8e5e00096b9844f8db6cc51cbc114f42c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23148)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-08 Thread via GitHub


hudi-bot commented on PR #10980:
URL: https://github.com/apache/hudi/pull/10980#issuecomment-2042627787

   
   ## CI report:
   
   * 07e398007c1557d3e17adc3d8a36d8778ed3e976 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23147)
 
   * 36b0e8f8e5e00096b9844f8db6cc51cbc114f42c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-08 Thread via GitHub


beyond1920 commented on code in PR #10980:
URL: https://github.com/apache/hudi/pull/10980#discussion_r1555662848


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java:
##
@@ -147,6 +149,13 @@ public HoodieMergeHandle(HoodieWriteConfig config, String 
instantTime, HoodieTab
 this.preserveMetadata = true;
 init(fileId, this.partitionPath, dataFileToBeMerged);
 validateAndSetAndKeyGenProps(keyGeneratorOpt, config.populateMetaFields());
+// if the old schema equals to the new schema, avoid heavy rewriting
+if (config.populateMetaFields() && useWriterSchemaForCompaction) {
+  LOG.info("Using update instead rewriting during compaction");
+  copyOldFunc = (key, record, schema, prop) -> 
this.updateMetadataToOldRecord(key, record, schema, prop);

Review Comment:
   Not exactly.
   The behavior is consistent with the old behavior.
   
   https://github.com/apache/hudi/assets/1525333/e254eab0-9c22-4658-a4a5-cc8faae9d2af;>
   
   
   https://github.com/apache/hudi/assets/1525333/438c9ee9-1189-4928-9c48-e102625c5967;>
   
   In the above pictures,  if `config.populateMetaFields() ` is true for 
compaction job, the `oldSchema` is equals to `writeSchemaWithMetaFields`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6330][DOCS] Update user doc to show how to use consistent bucket index for Flink engine [hudi]

2024-04-08 Thread via GitHub


beyond1920 commented on code in PR #10977:
URL: https://github.com/apache/hudi/pull/10977#discussion_r1555701329


##
website/docs/sql_dml.md:
##
@@ -390,3 +390,70 @@ and `clean.async.enabled` options are used to disable the 
compaction and cleanin
 This is done to ensure that the compaction and cleaning services are not 
executed twice for the same table.
 
 
+### Consistent hashing index (Experimental)
+
+We have introduced the Consistent Hashing Index since [0.13.0 
release](/releases/release-0.13.0#consistent-hashing-index). In comparison to 
the static hashing index ([Bucket 
Index](/releases/release-0.11.0#bucket-index)), the consistent hashing index 
offers dynamic scalability of data buckets for the writer. 
+You can find the 
[RFC](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) for the 
design of this feature.
+In the 0.13.X release, the Consistent Hashing Index is supported only for 
Spark engine. And since [release 
0.14.0](/releases/release-0.14.0#consistent-hashing-index-support), the index 
is supported for Flink engine.
+
+In the below example, we have a streaming ingestion pipeline that written to 
the table with consistent bucket index.
+To utilize this feature, configure the option `index.type` as `BUCKET` and set 
`hoodie.index.bucket.engine` to `CONSISTENT_HASHING`.
+When enabling the consistent hashing index, it's important to enable 
clustering scheduling within the writer. During this process, the writer will 
perform dual writes for both the old and new data buckets while the clustering 
is pending. Although the dual write does not impact correctness, it is strongly 
recommended to execute clustering as quickly as possible.
+
+```sql
+-- set the interval as 30 seconds
+execution.checkpointing.interval: 3
+state.backend: rocksdb
+
+-- This is a datagen source that can generates records continuously
+CREATE TABLE sourceT (
+uuid varchar(20),
+name varchar(10),
+age int,
+ts timestamp(3),
+`partition` as 'par1'
+) WITH (
+'connector' = 'datagen',
+'rows-per-second' = '200'
+);
+
+-- Create the hudi table with consistent bucket index
+CREATE TABLE t1(
+uuid VARCHAR(20) PRIMARY KEY NOT ENFORCED,
+name VARCHAR(10),
+age INT,
+ts TIMESTAMP(3),
+`partition` VARCHAR(20)
+)
+PARTITIONED BY (`partition`)
+WITH (
+'connector'='hudi',
+'path' = '${work_path}/hudi-demo/hudiT',
+'table.type' = 'MERGE_ON_READ',
+'index.type' = 'BUCKET',
+'clustering.schedule.enabled'='true',
+'hoodie.index.bucket.engine'='CONSISTENT_HASHING',
+
'hoodie.clustering.plan.strategy.class'='org.apache.hudi.client.clustering.plan.strategy.FlinkConsistentBucketClusteringPlanStrategy',
+
'hoodie.clustering.execution.strategy.class'='org.apache.hudi.client.clustering.run.strategy.SparkConsistentBucketClusteringExecutionStrategy',
+'hoodie.bucket.index.num.buckets'='8',
+'hoodie.bucket.index.max.num.buckets'='128',
+'hoodie.bucket.index.min.num.buckets'='8',
+'hoodie.bucket.index.split.threshold'='1.5',
+'write.tasks'='2'
+);
+
+-- submit the pipelines
+insert into t1 select * from sourceT;
+
+select * from t1 limit 20;
+```
+
+:::caution
+Consistent Hashing Index is supported for Flink engine since [release 
0.14.0](/releases/release-0.14.0#consistent-hashing-index-support) and 
currently there are some limitations to use it as of 0.14.0:
+
+- This index is supported only for MOR table. This limitation also exists even 
if using Spark engine.
+- It does not work with metadata table enabled. This limitation also exists 
even if using Spark engine.
+- Consistent hashing index does not work with bulk-insert using Flink engine 
yet, please use simple bucket index or Spark engine for bulk-insert pipelines.
+- The resize plan which generated by Flink engine does not support merge small 
file groups yet, but only support split big file group.
+- The resize plan should be executed through an offline Spark job.

Review Comment:
   Flink engine does not support execute resize plan yet.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]insert_overwrite_table table slow [hudi]

2024-04-08 Thread via GitHub


ad1happy2go commented on issue #10979:
URL: https://github.com/apache/hudi/issues/10979#issuecomment-2042560266

   @wkhappy1 What is the format of your input data?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6330][DOCS] Update user doc to show how to use consistent bucket index for Flink engine [hudi]

2024-04-08 Thread via GitHub


beyond1920 commented on code in PR #10977:
URL: https://github.com/apache/hudi/pull/10977#discussion_r1555696442


##
website/docs/sql_dml.md:
##
@@ -390,3 +390,70 @@ and `clean.async.enabled` options are used to disable the 
compaction and cleanin
 This is done to ensure that the compaction and cleaning services are not 
executed twice for the same table.
 
 
+### Consistent hashing index (Experimental)
+
+We have introduced the Consistent Hashing Index since [0.13.0 
release](/releases/release-0.13.0#consistent-hashing-index). In comparison to 
the static hashing index ([Bucket 
Index](/releases/release-0.11.0#bucket-index)), the consistent hashing index 
offers dynamic scalability of data buckets for the writer. 
+You can find the 
[RFC](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) for the 
design of this feature.
+In the 0.13.X release, the Consistent Hashing Index is supported only for 
Spark engine. And since [release 
0.14.0](/releases/release-0.14.0#consistent-hashing-index-support), the index 
is supported for Flink engine.
+
+In the below example, we have a streaming ingestion pipeline that written to 
the table with consistent bucket index.
+To utilize this feature, configure the option `index.type` as `BUCKET` and set 
`hoodie.index.bucket.engine` to `CONSISTENT_HASHING`.
+When enabling the consistent hashing index, it's important to enable 
clustering scheduling within the writer. During this process, the writer will 
perform dual writes for both the old and new data buckets while the clustering 
is pending. Although the dual write does not impact correctness, it is strongly 
recommended to execute clustering as quickly as possible.
+
+```sql
+-- set the interval as 30 seconds
+execution.checkpointing.interval: 3
+state.backend: rocksdb
+
+-- This is a datagen source that can generates records continuously

Review Comment:
   I prefer to add the source table schema here in order to keep the demo 
complete. Then users could conveniently copy the complete demo and run it in 
SqlClient.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



  1   2   >