Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]

via GitHub Tue, 14 May 2024 14:27:18 -0700


nsivabalan commented on code in PR #10915:
URL: https://github.com/apache/hudi/pull/10915#discussion_r1600668418



##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/cdc/TestCDCDataFrameSuite.scala:
##########
@@ -407,27 +409,32 @@ class TestCDCDataFrameSuite extends HoodieCDCTestBase {
     val inputDF7 = spark.read.json(spark.sparkContext.parallelize(records7, 2))
     inputDF7.write.format("org.apache.hudi")
       .options(options)
+      .option("hoodie.compact.inline", "false")
       .mode(SaveMode.Append)
       .save(basePath)
+    totalInsertedCnt += 7
 
     val records8 = recordsToStrings(dataGen.generateInserts("007", 
3)).asScala.toList
     val inputDF8 = spark.read.json(spark.sparkContext.parallelize(records8, 2))
     inputDF8.write.format("org.apache.hudi")
       .options(options)
+      .option("hoodie.compact.inline", "false")
       .mode(SaveMode.Append)
       .save(basePath)
     val instant8 = metaClient.reloadActiveTimeline.lastInstant().get()
     val commitTime8 = instant8.getTimestamp
+    totalInsertedCnt += 3
 
     // 8. Upsert Operation With Clean Operation
-    val records9 = recordsToStrings(dataGen.generateUniqueUpdates("008", 
30)).asScala.toList
-    val inputDF9 = spark.read.json(spark.sparkContext.parallelize(records9, 2))
+    val inputDF9 = inputDF6.limit(30) // 30 updates to inserts added after 
insert overwrite table. if not for this, updates generated from datagne,
+    // could split as inserts and updates from hudi standpoint due to insert 
overwrite table operation.
     inputDF9.write.format("org.apache.hudi")
       .options(options)
       .option("hoodie.clean.automatic", "true")
-      .option("hoodie.keep.min.commits", "4")
-      .option("hoodie.keep.max.commits", "5")
-      .option("hoodie.clean.commits.retained", "3")
+      .option("hoodie.keep.min.commits", "16")

Review Comment:
   hey @bvaradar : I had a long discussion w/ @danny0405  on this. 
   here is the issue:
   our data table archival polls for compaction commit and does something based 
on that. Before this patch, we had a bug no canSchedule call wrt compaction 
scheduling. This patched fixed the issue as you might know. 
   
   But this specific test started to fail w/ FileNotFound. Apparently in 
HoodieCDCExtractor, we parse all entries in commit metadata and [poll file 
status/listStatus](https://github.com/apache/hudi/blob/c8dec0ef523e998b80838af9d52323e3ae95cebf/hudi-common/src/main/java/org/apache/hudi/common/table/cdc/HoodieCDCExtractor.java#L331)
 for them.
   
   Prior to this patch, due to archival behavior, every commit in active 
timeline is uncleaned and we were good. after this patch, apparently archival 
is trailing, and so we do have some commits in active timeline which are 
cleaned up and hence the fileStatus/listStatus polling results in FileNotFound 
Issue. 
   
   Danny agreed that this is a long pending/known limitation and we could 
definitely improve user exp by throwing some legible exception/error msg. But 
it increases the scope of this patch. so, we agreed to tweak the test so that 
it will not hit FileNotFound issue. 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]

Reply via email to