Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]

2024-05-14 Thread via GitHub


jonvex merged PR #10915:
URL: https://github.com/apache/hudi/pull/10915


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]

2024-05-14 Thread via GitHub


jonvex commented on PR #10915:
URL: https://github.com/apache/hudi/pull/10915#issuecomment-2111382582

   https://github.com/apache/hudi/assets/26940621/04586717-9262-41f9-9c1b-bb848e5c184c;>
   azure is actually passing


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]

2024-05-14 Thread via GitHub


bvaradar commented on code in PR #10915:
URL: https://github.com/apache/hudi/pull/10915#discussion_r1600707506


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/cdc/TestCDCDataFrameSuite.scala:
##
@@ -407,27 +409,32 @@ class TestCDCDataFrameSuite extends HoodieCDCTestBase {
 val inputDF7 = spark.read.json(spark.sparkContext.parallelize(records7, 2))
 inputDF7.write.format("org.apache.hudi")
   .options(options)
+  .option("hoodie.compact.inline", "false")
   .mode(SaveMode.Append)
   .save(basePath)
+totalInsertedCnt += 7
 
 val records8 = recordsToStrings(dataGen.generateInserts("007", 
3)).asScala.toList
 val inputDF8 = spark.read.json(spark.sparkContext.parallelize(records8, 2))
 inputDF8.write.format("org.apache.hudi")
   .options(options)
+  .option("hoodie.compact.inline", "false")
   .mode(SaveMode.Append)
   .save(basePath)
 val instant8 = metaClient.reloadActiveTimeline.lastInstant().get()
 val commitTime8 = instant8.getTimestamp
+totalInsertedCnt += 3
 
 // 8. Upsert Operation With Clean Operation
-val records9 = recordsToStrings(dataGen.generateUniqueUpdates("008", 
30)).asScala.toList
-val inputDF9 = spark.read.json(spark.sparkContext.parallelize(records9, 2))
+val inputDF9 = inputDF6.limit(30) // 30 updates to inserts added after 
insert overwrite table. if not for this, updates generated from datagne,
+// could split as inserts and updates from hudi standpoint due to insert 
overwrite table operation.
 inputDF9.write.format("org.apache.hudi")
   .options(options)
   .option("hoodie.clean.automatic", "true")
-  .option("hoodie.keep.min.commits", "4")
-  .option("hoodie.keep.max.commits", "5")
-  .option("hoodie.clean.commits.retained", "3")
+  .option("hoodie.keep.min.commits", "16")

Review Comment:
   Sounds good @nsivabalan . Can you file a jira to track this (if not already)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]

2024-05-14 Thread via GitHub


nsivabalan commented on code in PR #10915:
URL: https://github.com/apache/hudi/pull/10915#discussion_r1600668418


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/cdc/TestCDCDataFrameSuite.scala:
##
@@ -407,27 +409,32 @@ class TestCDCDataFrameSuite extends HoodieCDCTestBase {
 val inputDF7 = spark.read.json(spark.sparkContext.parallelize(records7, 2))
 inputDF7.write.format("org.apache.hudi")
   .options(options)
+  .option("hoodie.compact.inline", "false")
   .mode(SaveMode.Append)
   .save(basePath)
+totalInsertedCnt += 7
 
 val records8 = recordsToStrings(dataGen.generateInserts("007", 
3)).asScala.toList
 val inputDF8 = spark.read.json(spark.sparkContext.parallelize(records8, 2))
 inputDF8.write.format("org.apache.hudi")
   .options(options)
+  .option("hoodie.compact.inline", "false")
   .mode(SaveMode.Append)
   .save(basePath)
 val instant8 = metaClient.reloadActiveTimeline.lastInstant().get()
 val commitTime8 = instant8.getTimestamp
+totalInsertedCnt += 3
 
 // 8. Upsert Operation With Clean Operation
-val records9 = recordsToStrings(dataGen.generateUniqueUpdates("008", 
30)).asScala.toList
-val inputDF9 = spark.read.json(spark.sparkContext.parallelize(records9, 2))
+val inputDF9 = inputDF6.limit(30) // 30 updates to inserts added after 
insert overwrite table. if not for this, updates generated from datagne,
+// could split as inserts and updates from hudi standpoint due to insert 
overwrite table operation.
 inputDF9.write.format("org.apache.hudi")
   .options(options)
   .option("hoodie.clean.automatic", "true")
-  .option("hoodie.keep.min.commits", "4")
-  .option("hoodie.keep.max.commits", "5")
-  .option("hoodie.clean.commits.retained", "3")
+  .option("hoodie.keep.min.commits", "16")

Review Comment:
   hey @bvaradar : I had a long discussion w/ @danny0405  on this. 
   here is the issue:
   our data table archival polls for compaction commit and does something based 
on that. Before this patch, we had a bug in canSchedule call wrt compaction 
scheduling. This patched fixed the issue as you might know. 
   
   But this specific test started to fail w/ FileNotFound. Apparently in 
HoodieCDCExtractor, we parse all entries in commit metadata and [poll file 
status/listStatus](https://github.com/apache/hudi/blob/c8dec0ef523e998b80838af9d52323e3ae95cebf/hudi-common/src/main/java/org/apache/hudi/common/table/cdc/HoodieCDCExtractor.java#L331)
 for them.
   
   Prior to this patch, due to archival behavior, every commit in active 
timeline is uncleaned and we were good. after this patch, apparently archival 
is trailing, and so we do have some commits in active timeline which are 
cleaned up and hence the fileStatus/listStatus polling results in FileNotFound 
Issue. 
   
   Danny agreed that this is a long pending/known limitation and we could 
definitely improve user exp by throwing some legible exception/error msg. But 
it increases the scope of this patch. so, we agreed to tweak the test so that 
it will not hit FileNotFound issue. 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]

2024-05-14 Thread via GitHub


nsivabalan commented on code in PR #10915:
URL: https://github.com/apache/hudi/pull/10915#discussion_r1600668418


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/cdc/TestCDCDataFrameSuite.scala:
##
@@ -407,27 +409,32 @@ class TestCDCDataFrameSuite extends HoodieCDCTestBase {
 val inputDF7 = spark.read.json(spark.sparkContext.parallelize(records7, 2))
 inputDF7.write.format("org.apache.hudi")
   .options(options)
+  .option("hoodie.compact.inline", "false")
   .mode(SaveMode.Append)
   .save(basePath)
+totalInsertedCnt += 7
 
 val records8 = recordsToStrings(dataGen.generateInserts("007", 
3)).asScala.toList
 val inputDF8 = spark.read.json(spark.sparkContext.parallelize(records8, 2))
 inputDF8.write.format("org.apache.hudi")
   .options(options)
+  .option("hoodie.compact.inline", "false")
   .mode(SaveMode.Append)
   .save(basePath)
 val instant8 = metaClient.reloadActiveTimeline.lastInstant().get()
 val commitTime8 = instant8.getTimestamp
+totalInsertedCnt += 3
 
 // 8. Upsert Operation With Clean Operation
-val records9 = recordsToStrings(dataGen.generateUniqueUpdates("008", 
30)).asScala.toList
-val inputDF9 = spark.read.json(spark.sparkContext.parallelize(records9, 2))
+val inputDF9 = inputDF6.limit(30) // 30 updates to inserts added after 
insert overwrite table. if not for this, updates generated from datagne,
+// could split as inserts and updates from hudi standpoint due to insert 
overwrite table operation.
 inputDF9.write.format("org.apache.hudi")
   .options(options)
   .option("hoodie.clean.automatic", "true")
-  .option("hoodie.keep.min.commits", "4")
-  .option("hoodie.keep.max.commits", "5")
-  .option("hoodie.clean.commits.retained", "3")
+  .option("hoodie.keep.min.commits", "16")

Review Comment:
   hey @bvaradar : I had a long discussion w/ @danny0405  on this. 
   here is the issue:
   our data table archival polls for compaction commit and does something based 
on that. Before this patch, we had a bug no canSchedule call wrt compaction 
scheduling. This patched fixed the issue as you might know. 
   
   But this specific test started to fail w/ FileNotFound. Apparently in 
HoodieCDCExtractor, we parse all entries in commit metadata and [poll file 
status/listStatus](https://github.com/apache/hudi/blob/c8dec0ef523e998b80838af9d52323e3ae95cebf/hudi-common/src/main/java/org/apache/hudi/common/table/cdc/HoodieCDCExtractor.java#L331)
 for them.
   
   Prior to this patch, due to archival behavior, every commit in active 
timeline is uncleaned and we were good. after this patch, apparently archival 
is trailing, and so we do have some commits in active timeline which are 
cleaned up and hence the fileStatus/listStatus polling results in FileNotFound 
Issue. 
   
   Danny agreed that this is a long pending/known limitation and we could 
definitely improve user exp by throwing some legible exception/error msg. But 
it increases the scope of this patch. so, we agreed to tweak the test so that 
it will not hit FileNotFound issue. 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]

2024-05-14 Thread via GitHub


bvaradar commented on code in PR #10915:
URL: https://github.com/apache/hudi/pull/10915#discussion_r1600600472


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/cdc/TestCDCDataFrameSuite.scala:
##
@@ -407,27 +409,32 @@ class TestCDCDataFrameSuite extends HoodieCDCTestBase {
 val inputDF7 = spark.read.json(spark.sparkContext.parallelize(records7, 2))
 inputDF7.write.format("org.apache.hudi")
   .options(options)
+  .option("hoodie.compact.inline", "false")
   .mode(SaveMode.Append)
   .save(basePath)
+totalInsertedCnt += 7
 
 val records8 = recordsToStrings(dataGen.generateInserts("007", 
3)).asScala.toList
 val inputDF8 = spark.read.json(spark.sparkContext.parallelize(records8, 2))
 inputDF8.write.format("org.apache.hudi")
   .options(options)
+  .option("hoodie.compact.inline", "false")
   .mode(SaveMode.Append)
   .save(basePath)
 val instant8 = metaClient.reloadActiveTimeline.lastInstant().get()
 val commitTime8 = instant8.getTimestamp
+totalInsertedCnt += 3
 
 // 8. Upsert Operation With Clean Operation
-val records9 = recordsToStrings(dataGen.generateUniqueUpdates("008", 
30)).asScala.toList
-val inputDF9 = spark.read.json(spark.sparkContext.parallelize(records9, 2))
+val inputDF9 = inputDF6.limit(30) // 30 updates to inserts added after 
insert overwrite table. if not for this, updates generated from datagne,
+// could split as inserts and updates from hudi standpoint due to insert 
overwrite table operation.
 inputDF9.write.format("org.apache.hudi")
   .options(options)
   .option("hoodie.clean.automatic", "true")
-  .option("hoodie.keep.min.commits", "4")
-  .option("hoodie.keep.max.commits", "5")
-  .option("hoodie.clean.commits.retained", "3")
+  .option("hoodie.keep.min.commits", "16")

Review Comment:
   @nsivabalan : Why did we change this test along with this PR ? I am not able 
to follow the comments added inline. 
   cc @yihua 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]

2024-05-14 Thread via GitHub


hudi-bot commented on PR #10915:
URL: https://github.com/apache/hudi/pull/10915#issuecomment-2111018843

   
   ## CI report:
   
   * 21fca8151efa9b404eac415e7432606f1bfb9658 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23928)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]

2024-05-14 Thread via GitHub


hudi-bot commented on PR #10915:
URL: https://github.com/apache/hudi/pull/10915#issuecomment-2110945729

   
   ## CI report:
   
   * 704439adb77ebe779febb07b529df5ca280f19d1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23920)
 
   * 21fca8151efa9b404eac415e7432606f1bfb9658 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23928)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]

2024-05-14 Thread via GitHub


hudi-bot commented on PR #10915:
URL: https://github.com/apache/hudi/pull/10915#issuecomment-2110924294

   
   ## CI report:
   
   * 704439adb77ebe779febb07b529df5ca280f19d1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23920)
 
   * 21fca8151efa9b404eac415e7432606f1bfb9658 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]

2024-05-14 Thread via GitHub


hudi-bot commented on PR #10915:
URL: https://github.com/apache/hudi/pull/10915#issuecomment-2110602326

   
   ## CI report:
   
   * 704439adb77ebe779febb07b529df5ca280f19d1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23920)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]

2024-05-14 Thread via GitHub


hudi-bot commented on PR #10915:
URL: https://github.com/apache/hudi/pull/10915#issuecomment-2110586327

   
   ## CI report:
   
   * ae80da44ef0b736d26626a961aaac8fd017e80a8 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23295)
 
   * 704439adb77ebe779febb07b529df5ca280f19d1 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]

2024-05-02 Thread via GitHub


yihua commented on PR #10915:
URL: https://github.com/apache/hudi/pull/10915#issuecomment-2092034933

   @nsivabalan could you rebase the PR on the latest master and address the 
review comments?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]

2024-04-18 Thread via GitHub


nsivabalan commented on code in PR #10915:
URL: https://github.com/apache/hudi/pull/10915#discussion_r1571247566


##
hudi-common/src/main/java/org/apache/hudi/common/table/cdc/HoodieCDCExtractor.java:
##
@@ -114,6 +114,24 @@ public Map> 
extractCDCFileSplits() {
 ValidationUtils.checkState(commits != null, "Empty commits");
 
 Map> fgToCommitChanges = new 
HashMap<>();
+

Review Comment:
   nope. I am saying HoodieCDCExtractor.java has some inherent bug which I am 
trying to fix here. 
   
   lets say timeline is as follows
   
   dc1
   dc2
   rc3
   dc4
   clean5 // cleans up data files from dc1 and dc2 since it was replaced by 
rc3. 
   
   as per master, HoodieCDCExtractor goes over commit metadata in 
activetimeline and tries to deduce base files for log files found. In this 
case, all data files from dc1 and dc2 are already deleted by clean5. but 
HoodieCDCExtractor tries to parse data files from dc1 and dc2 and so we might 
hit file not found issue as per master. 
   
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]

2024-04-18 Thread via GitHub


nsivabalan commented on code in PR #10915:
URL: https://github.com/apache/hudi/pull/10915#discussion_r1571247566


##
hudi-common/src/main/java/org/apache/hudi/common/table/cdc/HoodieCDCExtractor.java:
##
@@ -114,6 +114,24 @@ public Map> 
extractCDCFileSplits() {
 ValidationUtils.checkState(commits != null, "Empty commits");
 
 Map> fgToCommitChanges = new 
HashMap<>();
+

Review Comment:
   nope. I am saying HoodieCDCExtractor.java has some inherent bug which I am 
trying to fix here. 
   
   lets say timeline is as follows
   
   dc1
   dc2
   rc3
   dc4
   clean5 // cleans up data files from dc1 and dc2 since it was replaced by 
rc3. 
   
   as per master, HoodieCDCExtractor goes over commit metadata in 
activetimeline and tries to deduce base files for log files found. In this 
case, all data files from dc1 and dc2 are already deleted by clean5. And so we 
might hit file not found issue as per master. 
   
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]

2024-04-16 Thread via GitHub


hudi-bot commented on PR #10915:
URL: https://github.com/apache/hudi/pull/10915#issuecomment-2060202131

   
   ## CI report:
   
   * ae80da44ef0b736d26626a961aaac8fd017e80a8 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23295)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]

2024-04-16 Thread via GitHub


hudi-bot commented on PR #10915:
URL: https://github.com/apache/hudi/pull/10915#issuecomment-2060162686

   
   ## CI report:
   
   * acfe81fa3814c77cd04a39eb2bbddb9960bdc437 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22996)
 
   * ae80da44ef0b736d26626a961aaac8fd017e80a8 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23295)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]

2024-04-16 Thread via GitHub


danny0405 commented on code in PR #10915:
URL: https://github.com/apache/hudi/pull/10915#discussion_r1568087606


##
hudi-common/src/main/java/org/apache/hudi/common/table/cdc/HoodieCDCExtractor.java:
##
@@ -114,6 +114,24 @@ public Map> 
extractCDCFileSplits() {
 ValidationUtils.checkState(commits != null, "Empty commits");
 
 Map> fgToCommitChanges = new 
HashMap<>();
+

Review Comment:
   > we are looking to parse and fetch base files for dc1 and dc2 as well which 
could have been cleaned up by the cleaner and may not exists.
   
   Are you saying the cleaning state has changed because of the archiving? If 
that is true, maybe we should just adjust the cleaning strategy to keep more 
historical commits.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]

2024-04-16 Thread via GitHub


hudi-bot commented on PR #10915:
URL: https://github.com/apache/hudi/pull/10915#issuecomment-2060153308

   
   ## CI report:
   
   * acfe81fa3814c77cd04a39eb2bbddb9960bdc437 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22996)
 
   * ae80da44ef0b736d26626a961aaac8fd017e80a8 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]

2024-04-16 Thread via GitHub


nsivabalan commented on code in PR #10915:
URL: https://github.com/apache/hudi/pull/10915#discussion_r1568072027


##
hudi-common/src/main/java/org/apache/hudi/common/table/cdc/HoodieCDCExtractor.java:
##
@@ -114,6 +114,24 @@ public Map> 
extractCDCFileSplits() {
 ValidationUtils.checkState(commits != null, "Empty commits");
 
 Map> fgToCommitChanges = new 
HashMap<>();
+

Review Comment:
   hey @danny0405 @yihua : Can you folks rope in someone to assist w/ CDC fixes 
here. 
   I am not sure if this is the right fix. 
   
   but let me go over the gist. 
   There was a test failure which I was trying to chase and realized we might 
have some gaps here. 
   
   I am not going go to over exact scenario that the test failed. but the gist 
is this.
   
   dc1
   dc2
   rc3
   dc4
   clean5
   
   in this case, we are looking to parse and fetch base files for dc1 and dc2 
as well which could have been cleaned up by the cleaner and may not exists. 
   
   Before this fix, the test was set up such that, timeline was 
   rc3
   dc4
   clean5
   
   and so we did not hit this issue. When I made the fix to ensure 
getLatestDeltaCommits does the right thing (which archival is using btw), we 
ended up w/ timeline as 
   dc1
   dc2
   rc3
   dc4
   clean5
   
   and   
   ```
   private HoodieCDCFileSplit parseWriteStat(
 HoodieFileGroupId fileGroupId,
 HoodieInstant instant,
 HoodieWriteStat writeStat,
 WriteOperationType operation) {
   ``` 
   in HoodieCDCExtractor is failing. 
   
   
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]

2024-04-16 Thread via GitHub


nsivabalan commented on code in PR #10915:
URL: https://github.com/apache/hudi/pull/10915#discussion_r1568072805


##
hudi-common/src/main/java/org/apache/hudi/common/table/cdc/HoodieCDCExtractor.java:
##
@@ -114,6 +114,24 @@ public Map> 
extractCDCFileSplits() {
 ValidationUtils.checkState(commits != null, "Empty commits");
 
 Map> fgToCommitChanges = new 
HashMap<>();
+

Review Comment:
   So, I tried making a fix. 
   first collect all replacedFileIds. and then ignore those while parsing other 
commit metadata. 
   
   I was able to get past the parsing issue. but assertions for record counts 
are failing. probably anyone who has worked on CDC might be better to fix this. 
   
   Test case of interest: 
   TestCDCDataFrameSuite.testMORDataSourceWrite
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]

2024-04-16 Thread via GitHub


nsivabalan commented on code in PR #10915:
URL: https://github.com/apache/hudi/pull/10915#discussion_r1568072027


##
hudi-common/src/main/java/org/apache/hudi/common/table/cdc/HoodieCDCExtractor.java:
##
@@ -114,6 +114,24 @@ public Map> 
extractCDCFileSplits() {
 ValidationUtils.checkState(commits != null, "Empty commits");
 
 Map> fgToCommitChanges = new 
HashMap<>();
+

Review Comment:
   hey @danny0405 @yihua : Can you folks rope in someone to assist w/ CDC fixes 
here. 
   I am not sure if this is the right fix. 
   
   but let me go over the gist. 
   There was a test failure which I was trying to chase and realized we might 
have some gaps here. 
   
   I am not going go to over exact scenario that the test failed. but the gist 
is this.
   
   dc1
   dc2
   rc3
   dc4
   clean5
   
   in this case, we are looking to parse and fetch base files for dc1 and dc2 
as well which could have been cleaned up by the cleaner and may not exists. 
   
   Before this fix, the test was set up such that, timeline was 
   rc3
   dc4
   clean5
   
   and so we did not hit this issue. When I made the fix to ensure 
getLatestDeltaCommits does the right thing (which archival is using btw), we 
ended up w/ timeline as 
   dc1
   dc2
   rc3
   dc4
   clean5
   
   and   private HoodieCDCFileSplit parseWriteStat(
 HoodieFileGroupId fileGroupId,
 HoodieInstant instant,
 HoodieWriteStat writeStat,
 WriteOperationType operation) {
   in HoodieCDCExtractor is failing. 
   
   
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]

2024-04-03 Thread via GitHub


codope commented on code in PR #10915:
URL: https://github.com/apache/hudi/pull/10915#discussion_r1549072910


##
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieDefaultTimeline.java:
##
@@ -147,6 +147,12 @@ public HoodieTimeline 
filterCompletedAndCompactionInstants() {
 || s.getAction().equals(HoodieTimeline.COMPACTION_ACTION)), 
details);
   }
 
+  @Override
+  public HoodieTimeline filterCompletedCommitInstants() {

Review Comment:
   Let's unit test this.



##
hudi-common/src/main/java/org/apache/hudi/common/util/CompactionUtils.java:
##
@@ -285,8 +285,7 @@ public static List 
getPendingCompactionInstantTimes(HoodieTableMe
*/
   public static Option> 
getDeltaCommitsSinceLatestCompaction(
   HoodieActiveTimeline activeTimeline) {
-Option lastCompaction = activeTimeline.getCommitTimeline()
-.filterCompletedInstants().lastInstant();
+Option lastCompaction = 
activeTimeline.filterCompletedCommitInstants().lastInstant();

Review Comment:
   This change is causing a test failure -  
`TestCDCDataFrameSuite.testMORDataSourceWrite`. Can you please check that?
   Also, I think instead of adding yet another API, we can clean up the 
existing ones.
   For instance, have separate method for each action - `getTimeline`. 
So, `getCommitTimeline` only checks for  COMMIT action.



##
hudi-common/src/test/java/org/apache/hudi/common/util/TestCompactionUtils.java:
##
@@ -272,6 +272,59 @@ public void 
testGetDeltaCommitsSinceLatestCompaction(boolean hasCompletedCompact
 }
   }
 
+  @Test
+  public void 
testGetDeltaCommitsSinceLastCompactionWithCompletedReplaceCommits() {

Review Comment:
   +1 for the test



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]

2024-03-25 Thread via GitHub


danny0405 commented on code in PR #10915:
URL: https://github.com/apache/hudi/pull/10915#discussion_r1538463640


##
hudi-common/src/main/java/org/apache/hudi/common/util/CompactionUtils.java:
##
@@ -285,8 +285,7 @@ public static List 
getPendingCompactionInstantTimes(HoodieTableMe
*/
   public static Option> 
getDeltaCommitsSinceLatestCompaction(
   HoodieActiveTimeline activeTimeline) {
-Option lastCompaction = activeTimeline.getCommitTimeline()
-.filterCompletedInstants().lastInstant();
+Option lastCompaction = 
activeTimeline.filterCompletedCommitInstants().lastInstant();

Review Comment:
   And we better not reuse the action name for replace commit and compaction 
commit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]

2024-03-25 Thread via GitHub


bvaradar commented on code in PR #10915:
URL: https://github.com/apache/hudi/pull/10915#discussion_r1538107857


##
hudi-common/src/main/java/org/apache/hudi/common/util/CompactionUtils.java:
##
@@ -285,8 +285,7 @@ public static List 
getPendingCompactionInstantTimes(HoodieTableMe
*/
   public static Option> 
getDeltaCommitsSinceLatestCompaction(
   HoodieActiveTimeline activeTimeline) {
-Option lastCompaction = activeTimeline.getCommitTimeline()
-.filterCompletedInstants().lastInstant();
+Option lastCompaction = 
activeTimeline.filterCompletedCommitInstants().lastInstant();

Review Comment:
   @nsivabalan : Can you rename getCommitTimeline() to 
getCommitAndReplaceTimeline() and then add an API getCommitOnlyTimeline(). 
   The current naming of these APIs are very problematic and can easily cause 
bugs. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]

2024-03-22 Thread via GitHub


hudi-bot commented on PR #10915:
URL: https://github.com/apache/hudi/pull/10915#issuecomment-2016212906

   
   ## CI report:
   
   * acfe81fa3814c77cd04a39eb2bbddb9960bdc437 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22996)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]

2024-03-22 Thread via GitHub


hudi-bot commented on PR #10915:
URL: https://github.com/apache/hudi/pull/10915#issuecomment-2016112557

   
   ## CI report:
   
   * acfe81fa3814c77cd04a39eb2bbddb9960bdc437 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22996)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]

2024-03-22 Thread via GitHub


hudi-bot commented on PR #10915:
URL: https://github.com/apache/hudi/pull/10915#issuecomment-2016092933

   
   ## CI report:
   
   * acfe81fa3814c77cd04a39eb2bbddb9960bdc437 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org