Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]
jonvex merged PR #10915: URL: https://github.com/apache/hudi/pull/10915 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]
jonvex commented on PR #10915: URL: https://github.com/apache/hudi/pull/10915#issuecomment-2111382582 https://github.com/apache/hudi/assets/26940621/04586717-9262-41f9-9c1b-bb848e5c184c;> azure is actually passing -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]
bvaradar commented on code in PR #10915: URL: https://github.com/apache/hudi/pull/10915#discussion_r1600707506 ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/cdc/TestCDCDataFrameSuite.scala: ## @@ -407,27 +409,32 @@ class TestCDCDataFrameSuite extends HoodieCDCTestBase { val inputDF7 = spark.read.json(spark.sparkContext.parallelize(records7, 2)) inputDF7.write.format("org.apache.hudi") .options(options) + .option("hoodie.compact.inline", "false") .mode(SaveMode.Append) .save(basePath) +totalInsertedCnt += 7 val records8 = recordsToStrings(dataGen.generateInserts("007", 3)).asScala.toList val inputDF8 = spark.read.json(spark.sparkContext.parallelize(records8, 2)) inputDF8.write.format("org.apache.hudi") .options(options) + .option("hoodie.compact.inline", "false") .mode(SaveMode.Append) .save(basePath) val instant8 = metaClient.reloadActiveTimeline.lastInstant().get() val commitTime8 = instant8.getTimestamp +totalInsertedCnt += 3 // 8. Upsert Operation With Clean Operation -val records9 = recordsToStrings(dataGen.generateUniqueUpdates("008", 30)).asScala.toList -val inputDF9 = spark.read.json(spark.sparkContext.parallelize(records9, 2)) +val inputDF9 = inputDF6.limit(30) // 30 updates to inserts added after insert overwrite table. if not for this, updates generated from datagne, +// could split as inserts and updates from hudi standpoint due to insert overwrite table operation. inputDF9.write.format("org.apache.hudi") .options(options) .option("hoodie.clean.automatic", "true") - .option("hoodie.keep.min.commits", "4") - .option("hoodie.keep.max.commits", "5") - .option("hoodie.clean.commits.retained", "3") + .option("hoodie.keep.min.commits", "16") Review Comment: Sounds good @nsivabalan . Can you file a jira to track this (if not already) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]
nsivabalan commented on code in PR #10915: URL: https://github.com/apache/hudi/pull/10915#discussion_r1600668418 ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/cdc/TestCDCDataFrameSuite.scala: ## @@ -407,27 +409,32 @@ class TestCDCDataFrameSuite extends HoodieCDCTestBase { val inputDF7 = spark.read.json(spark.sparkContext.parallelize(records7, 2)) inputDF7.write.format("org.apache.hudi") .options(options) + .option("hoodie.compact.inline", "false") .mode(SaveMode.Append) .save(basePath) +totalInsertedCnt += 7 val records8 = recordsToStrings(dataGen.generateInserts("007", 3)).asScala.toList val inputDF8 = spark.read.json(spark.sparkContext.parallelize(records8, 2)) inputDF8.write.format("org.apache.hudi") .options(options) + .option("hoodie.compact.inline", "false") .mode(SaveMode.Append) .save(basePath) val instant8 = metaClient.reloadActiveTimeline.lastInstant().get() val commitTime8 = instant8.getTimestamp +totalInsertedCnt += 3 // 8. Upsert Operation With Clean Operation -val records9 = recordsToStrings(dataGen.generateUniqueUpdates("008", 30)).asScala.toList -val inputDF9 = spark.read.json(spark.sparkContext.parallelize(records9, 2)) +val inputDF9 = inputDF6.limit(30) // 30 updates to inserts added after insert overwrite table. if not for this, updates generated from datagne, +// could split as inserts and updates from hudi standpoint due to insert overwrite table operation. inputDF9.write.format("org.apache.hudi") .options(options) .option("hoodie.clean.automatic", "true") - .option("hoodie.keep.min.commits", "4") - .option("hoodie.keep.max.commits", "5") - .option("hoodie.clean.commits.retained", "3") + .option("hoodie.keep.min.commits", "16") Review Comment: hey @bvaradar : I had a long discussion w/ @danny0405 on this. here is the issue: our data table archival polls for compaction commit and does something based on that. Before this patch, we had a bug in canSchedule call wrt compaction scheduling. This patched fixed the issue as you might know. But this specific test started to fail w/ FileNotFound. Apparently in HoodieCDCExtractor, we parse all entries in commit metadata and [poll file status/listStatus](https://github.com/apache/hudi/blob/c8dec0ef523e998b80838af9d52323e3ae95cebf/hudi-common/src/main/java/org/apache/hudi/common/table/cdc/HoodieCDCExtractor.java#L331) for them. Prior to this patch, due to archival behavior, every commit in active timeline is uncleaned and we were good. after this patch, apparently archival is trailing, and so we do have some commits in active timeline which are cleaned up and hence the fileStatus/listStatus polling results in FileNotFound Issue. Danny agreed that this is a long pending/known limitation and we could definitely improve user exp by throwing some legible exception/error msg. But it increases the scope of this patch. so, we agreed to tweak the test so that it will not hit FileNotFound issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]
nsivabalan commented on code in PR #10915: URL: https://github.com/apache/hudi/pull/10915#discussion_r1600668418 ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/cdc/TestCDCDataFrameSuite.scala: ## @@ -407,27 +409,32 @@ class TestCDCDataFrameSuite extends HoodieCDCTestBase { val inputDF7 = spark.read.json(spark.sparkContext.parallelize(records7, 2)) inputDF7.write.format("org.apache.hudi") .options(options) + .option("hoodie.compact.inline", "false") .mode(SaveMode.Append) .save(basePath) +totalInsertedCnt += 7 val records8 = recordsToStrings(dataGen.generateInserts("007", 3)).asScala.toList val inputDF8 = spark.read.json(spark.sparkContext.parallelize(records8, 2)) inputDF8.write.format("org.apache.hudi") .options(options) + .option("hoodie.compact.inline", "false") .mode(SaveMode.Append) .save(basePath) val instant8 = metaClient.reloadActiveTimeline.lastInstant().get() val commitTime8 = instant8.getTimestamp +totalInsertedCnt += 3 // 8. Upsert Operation With Clean Operation -val records9 = recordsToStrings(dataGen.generateUniqueUpdates("008", 30)).asScala.toList -val inputDF9 = spark.read.json(spark.sparkContext.parallelize(records9, 2)) +val inputDF9 = inputDF6.limit(30) // 30 updates to inserts added after insert overwrite table. if not for this, updates generated from datagne, +// could split as inserts and updates from hudi standpoint due to insert overwrite table operation. inputDF9.write.format("org.apache.hudi") .options(options) .option("hoodie.clean.automatic", "true") - .option("hoodie.keep.min.commits", "4") - .option("hoodie.keep.max.commits", "5") - .option("hoodie.clean.commits.retained", "3") + .option("hoodie.keep.min.commits", "16") Review Comment: hey @bvaradar : I had a long discussion w/ @danny0405 on this. here is the issue: our data table archival polls for compaction commit and does something based on that. Before this patch, we had a bug no canSchedule call wrt compaction scheduling. This patched fixed the issue as you might know. But this specific test started to fail w/ FileNotFound. Apparently in HoodieCDCExtractor, we parse all entries in commit metadata and [poll file status/listStatus](https://github.com/apache/hudi/blob/c8dec0ef523e998b80838af9d52323e3ae95cebf/hudi-common/src/main/java/org/apache/hudi/common/table/cdc/HoodieCDCExtractor.java#L331) for them. Prior to this patch, due to archival behavior, every commit in active timeline is uncleaned and we were good. after this patch, apparently archival is trailing, and so we do have some commits in active timeline which are cleaned up and hence the fileStatus/listStatus polling results in FileNotFound Issue. Danny agreed that this is a long pending/known limitation and we could definitely improve user exp by throwing some legible exception/error msg. But it increases the scope of this patch. so, we agreed to tweak the test so that it will not hit FileNotFound issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]
bvaradar commented on code in PR #10915: URL: https://github.com/apache/hudi/pull/10915#discussion_r1600600472 ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/cdc/TestCDCDataFrameSuite.scala: ## @@ -407,27 +409,32 @@ class TestCDCDataFrameSuite extends HoodieCDCTestBase { val inputDF7 = spark.read.json(spark.sparkContext.parallelize(records7, 2)) inputDF7.write.format("org.apache.hudi") .options(options) + .option("hoodie.compact.inline", "false") .mode(SaveMode.Append) .save(basePath) +totalInsertedCnt += 7 val records8 = recordsToStrings(dataGen.generateInserts("007", 3)).asScala.toList val inputDF8 = spark.read.json(spark.sparkContext.parallelize(records8, 2)) inputDF8.write.format("org.apache.hudi") .options(options) + .option("hoodie.compact.inline", "false") .mode(SaveMode.Append) .save(basePath) val instant8 = metaClient.reloadActiveTimeline.lastInstant().get() val commitTime8 = instant8.getTimestamp +totalInsertedCnt += 3 // 8. Upsert Operation With Clean Operation -val records9 = recordsToStrings(dataGen.generateUniqueUpdates("008", 30)).asScala.toList -val inputDF9 = spark.read.json(spark.sparkContext.parallelize(records9, 2)) +val inputDF9 = inputDF6.limit(30) // 30 updates to inserts added after insert overwrite table. if not for this, updates generated from datagne, +// could split as inserts and updates from hudi standpoint due to insert overwrite table operation. inputDF9.write.format("org.apache.hudi") .options(options) .option("hoodie.clean.automatic", "true") - .option("hoodie.keep.min.commits", "4") - .option("hoodie.keep.max.commits", "5") - .option("hoodie.clean.commits.retained", "3") + .option("hoodie.keep.min.commits", "16") Review Comment: @nsivabalan : Why did we change this test along with this PR ? I am not able to follow the comments added inline. cc @yihua -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]
hudi-bot commented on PR #10915: URL: https://github.com/apache/hudi/pull/10915#issuecomment-2111018843 ## CI report: * 21fca8151efa9b404eac415e7432606f1bfb9658 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23928) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]
hudi-bot commented on PR #10915: URL: https://github.com/apache/hudi/pull/10915#issuecomment-2110945729 ## CI report: * 704439adb77ebe779febb07b529df5ca280f19d1 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23920) * 21fca8151efa9b404eac415e7432606f1bfb9658 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23928) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]
hudi-bot commented on PR #10915: URL: https://github.com/apache/hudi/pull/10915#issuecomment-2110924294 ## CI report: * 704439adb77ebe779febb07b529df5ca280f19d1 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23920) * 21fca8151efa9b404eac415e7432606f1bfb9658 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]
hudi-bot commented on PR #10915: URL: https://github.com/apache/hudi/pull/10915#issuecomment-2110602326 ## CI report: * 704439adb77ebe779febb07b529df5ca280f19d1 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23920) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]
hudi-bot commented on PR #10915: URL: https://github.com/apache/hudi/pull/10915#issuecomment-2110586327 ## CI report: * ae80da44ef0b736d26626a961aaac8fd017e80a8 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23295) * 704439adb77ebe779febb07b529df5ca280f19d1 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]
yihua commented on PR #10915: URL: https://github.com/apache/hudi/pull/10915#issuecomment-2092034933 @nsivabalan could you rebase the PR on the latest master and address the review comments? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]
nsivabalan commented on code in PR #10915: URL: https://github.com/apache/hudi/pull/10915#discussion_r1571247566 ## hudi-common/src/main/java/org/apache/hudi/common/table/cdc/HoodieCDCExtractor.java: ## @@ -114,6 +114,24 @@ public Map> extractCDCFileSplits() { ValidationUtils.checkState(commits != null, "Empty commits"); Map> fgToCommitChanges = new HashMap<>(); + Review Comment: nope. I am saying HoodieCDCExtractor.java has some inherent bug which I am trying to fix here. lets say timeline is as follows dc1 dc2 rc3 dc4 clean5 // cleans up data files from dc1 and dc2 since it was replaced by rc3. as per master, HoodieCDCExtractor goes over commit metadata in activetimeline and tries to deduce base files for log files found. In this case, all data files from dc1 and dc2 are already deleted by clean5. but HoodieCDCExtractor tries to parse data files from dc1 and dc2 and so we might hit file not found issue as per master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]
nsivabalan commented on code in PR #10915: URL: https://github.com/apache/hudi/pull/10915#discussion_r1571247566 ## hudi-common/src/main/java/org/apache/hudi/common/table/cdc/HoodieCDCExtractor.java: ## @@ -114,6 +114,24 @@ public Map> extractCDCFileSplits() { ValidationUtils.checkState(commits != null, "Empty commits"); Map> fgToCommitChanges = new HashMap<>(); + Review Comment: nope. I am saying HoodieCDCExtractor.java has some inherent bug which I am trying to fix here. lets say timeline is as follows dc1 dc2 rc3 dc4 clean5 // cleans up data files from dc1 and dc2 since it was replaced by rc3. as per master, HoodieCDCExtractor goes over commit metadata in activetimeline and tries to deduce base files for log files found. In this case, all data files from dc1 and dc2 are already deleted by clean5. And so we might hit file not found issue as per master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]
hudi-bot commented on PR #10915: URL: https://github.com/apache/hudi/pull/10915#issuecomment-2060202131 ## CI report: * ae80da44ef0b736d26626a961aaac8fd017e80a8 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23295) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]
hudi-bot commented on PR #10915: URL: https://github.com/apache/hudi/pull/10915#issuecomment-2060162686 ## CI report: * acfe81fa3814c77cd04a39eb2bbddb9960bdc437 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22996) * ae80da44ef0b736d26626a961aaac8fd017e80a8 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23295) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]
danny0405 commented on code in PR #10915: URL: https://github.com/apache/hudi/pull/10915#discussion_r1568087606 ## hudi-common/src/main/java/org/apache/hudi/common/table/cdc/HoodieCDCExtractor.java: ## @@ -114,6 +114,24 @@ public Map> extractCDCFileSplits() { ValidationUtils.checkState(commits != null, "Empty commits"); Map> fgToCommitChanges = new HashMap<>(); + Review Comment: > we are looking to parse and fetch base files for dc1 and dc2 as well which could have been cleaned up by the cleaner and may not exists. Are you saying the cleaning state has changed because of the archiving? If that is true, maybe we should just adjust the cleaning strategy to keep more historical commits. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]
hudi-bot commented on PR #10915: URL: https://github.com/apache/hudi/pull/10915#issuecomment-2060153308 ## CI report: * acfe81fa3814c77cd04a39eb2bbddb9960bdc437 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22996) * ae80da44ef0b736d26626a961aaac8fd017e80a8 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]
nsivabalan commented on code in PR #10915: URL: https://github.com/apache/hudi/pull/10915#discussion_r1568072027 ## hudi-common/src/main/java/org/apache/hudi/common/table/cdc/HoodieCDCExtractor.java: ## @@ -114,6 +114,24 @@ public Map> extractCDCFileSplits() { ValidationUtils.checkState(commits != null, "Empty commits"); Map> fgToCommitChanges = new HashMap<>(); + Review Comment: hey @danny0405 @yihua : Can you folks rope in someone to assist w/ CDC fixes here. I am not sure if this is the right fix. but let me go over the gist. There was a test failure which I was trying to chase and realized we might have some gaps here. I am not going go to over exact scenario that the test failed. but the gist is this. dc1 dc2 rc3 dc4 clean5 in this case, we are looking to parse and fetch base files for dc1 and dc2 as well which could have been cleaned up by the cleaner and may not exists. Before this fix, the test was set up such that, timeline was rc3 dc4 clean5 and so we did not hit this issue. When I made the fix to ensure getLatestDeltaCommits does the right thing (which archival is using btw), we ended up w/ timeline as dc1 dc2 rc3 dc4 clean5 and ``` private HoodieCDCFileSplit parseWriteStat( HoodieFileGroupId fileGroupId, HoodieInstant instant, HoodieWriteStat writeStat, WriteOperationType operation) { ``` in HoodieCDCExtractor is failing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]
nsivabalan commented on code in PR #10915: URL: https://github.com/apache/hudi/pull/10915#discussion_r1568072805 ## hudi-common/src/main/java/org/apache/hudi/common/table/cdc/HoodieCDCExtractor.java: ## @@ -114,6 +114,24 @@ public Map> extractCDCFileSplits() { ValidationUtils.checkState(commits != null, "Empty commits"); Map> fgToCommitChanges = new HashMap<>(); + Review Comment: So, I tried making a fix. first collect all replacedFileIds. and then ignore those while parsing other commit metadata. I was able to get past the parsing issue. but assertions for record counts are failing. probably anyone who has worked on CDC might be better to fix this. Test case of interest: TestCDCDataFrameSuite.testMORDataSourceWrite -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]
nsivabalan commented on code in PR #10915: URL: https://github.com/apache/hudi/pull/10915#discussion_r1568072027 ## hudi-common/src/main/java/org/apache/hudi/common/table/cdc/HoodieCDCExtractor.java: ## @@ -114,6 +114,24 @@ public Map> extractCDCFileSplits() { ValidationUtils.checkState(commits != null, "Empty commits"); Map> fgToCommitChanges = new HashMap<>(); + Review Comment: hey @danny0405 @yihua : Can you folks rope in someone to assist w/ CDC fixes here. I am not sure if this is the right fix. but let me go over the gist. There was a test failure which I was trying to chase and realized we might have some gaps here. I am not going go to over exact scenario that the test failed. but the gist is this. dc1 dc2 rc3 dc4 clean5 in this case, we are looking to parse and fetch base files for dc1 and dc2 as well which could have been cleaned up by the cleaner and may not exists. Before this fix, the test was set up such that, timeline was rc3 dc4 clean5 and so we did not hit this issue. When I made the fix to ensure getLatestDeltaCommits does the right thing (which archival is using btw), we ended up w/ timeline as dc1 dc2 rc3 dc4 clean5 and private HoodieCDCFileSplit parseWriteStat( HoodieFileGroupId fileGroupId, HoodieInstant instant, HoodieWriteStat writeStat, WriteOperationType operation) { in HoodieCDCExtractor is failing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]
codope commented on code in PR #10915: URL: https://github.com/apache/hudi/pull/10915#discussion_r1549072910 ## hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieDefaultTimeline.java: ## @@ -147,6 +147,12 @@ public HoodieTimeline filterCompletedAndCompactionInstants() { || s.getAction().equals(HoodieTimeline.COMPACTION_ACTION)), details); } + @Override + public HoodieTimeline filterCompletedCommitInstants() { Review Comment: Let's unit test this. ## hudi-common/src/main/java/org/apache/hudi/common/util/CompactionUtils.java: ## @@ -285,8 +285,7 @@ public static List getPendingCompactionInstantTimes(HoodieTableMe */ public static Option> getDeltaCommitsSinceLatestCompaction( HoodieActiveTimeline activeTimeline) { -Option lastCompaction = activeTimeline.getCommitTimeline() -.filterCompletedInstants().lastInstant(); +Option lastCompaction = activeTimeline.filterCompletedCommitInstants().lastInstant(); Review Comment: This change is causing a test failure - `TestCDCDataFrameSuite.testMORDataSourceWrite`. Can you please check that? Also, I think instead of adding yet another API, we can clean up the existing ones. For instance, have separate method for each action - `getTimeline`. So, `getCommitTimeline` only checks for COMMIT action. ## hudi-common/src/test/java/org/apache/hudi/common/util/TestCompactionUtils.java: ## @@ -272,6 +272,59 @@ public void testGetDeltaCommitsSinceLatestCompaction(boolean hasCompletedCompact } } + @Test + public void testGetDeltaCommitsSinceLastCompactionWithCompletedReplaceCommits() { Review Comment: +1 for the test -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]
danny0405 commented on code in PR #10915: URL: https://github.com/apache/hudi/pull/10915#discussion_r1538463640 ## hudi-common/src/main/java/org/apache/hudi/common/util/CompactionUtils.java: ## @@ -285,8 +285,7 @@ public static List getPendingCompactionInstantTimes(HoodieTableMe */ public static Option> getDeltaCommitsSinceLatestCompaction( HoodieActiveTimeline activeTimeline) { -Option lastCompaction = activeTimeline.getCommitTimeline() -.filterCompletedInstants().lastInstant(); +Option lastCompaction = activeTimeline.filterCompletedCommitInstants().lastInstant(); Review Comment: And we better not reuse the action name for replace commit and compaction commit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]
bvaradar commented on code in PR #10915: URL: https://github.com/apache/hudi/pull/10915#discussion_r1538107857 ## hudi-common/src/main/java/org/apache/hudi/common/util/CompactionUtils.java: ## @@ -285,8 +285,7 @@ public static List getPendingCompactionInstantTimes(HoodieTableMe */ public static Option> getDeltaCommitsSinceLatestCompaction( HoodieActiveTimeline activeTimeline) { -Option lastCompaction = activeTimeline.getCommitTimeline() -.filterCompletedInstants().lastInstant(); +Option lastCompaction = activeTimeline.filterCompletedCommitInstants().lastInstant(); Review Comment: @nsivabalan : Can you rename getCommitTimeline() to getCommitAndReplaceTimeline() and then add an API getCommitOnlyTimeline(). The current naming of these APIs are very problematic and can easily cause bugs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]
hudi-bot commented on PR #10915: URL: https://github.com/apache/hudi/pull/10915#issuecomment-2016212906 ## CI report: * acfe81fa3814c77cd04a39eb2bbddb9960bdc437 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22996) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]
hudi-bot commented on PR #10915: URL: https://github.com/apache/hudi/pull/10915#issuecomment-2016112557 ## CI report: * acfe81fa3814c77cd04a39eb2bbddb9960bdc437 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22996) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]
hudi-bot commented on PR #10915: URL: https://github.com/apache/hudi/pull/10915#issuecomment-2016092933 ## CI report: * acfe81fa3814c77cd04a39eb2bbddb9960bdc437 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org