yihua commented on code in PR #19041:
URL: https://github.com/apache/hudi/pull/19041#discussion_r3445501705


##########
hudi-common/src/main/java/org/apache/hudi/common/util/CleanerUtils.java:
##########
@@ -151,8 +152,16 @@ public static Option<HoodieInstant> 
getEarliestCommitToRetain(
     } else if (cleaningPolicy == HoodieCleaningPolicy.KEEP_LATEST_BY_HOURS) {
       ZonedDateTime latestDateTime = ZonedDateTime.ofInstant(latestInstant, 
timeZone.getZoneId());
       String earliestTimeToRetain = 
TimelineUtils.formatDate(Date.from(latestDateTime.minusHours(hoursRetained).toInstant()));
-      earliestCommitToRetain = 
Option.fromJavaOptional(completedCommitsTimeline.getInstantsAsStream().filter(i 
-> compareTimestamps(i.requestedTime(),
-          GREATER_THAN_OR_EQUALS, earliestTimeToRetain)).findFirst());
+      earliestCommitToRetain = 
Option.fromJavaOptional(completedCommitsTimeline.getInstantsAsStream()
+          .filter(i -> compareTimestamps(i.requestedTime(), 
GREATER_THAN_OR_EQUALS, earliestTimeToRetain))
+          .findFirst());
+
+      Option<HoodieInstant> earliestPendingCommit = commitsTimeline.filter(s 
-> !s.isCompleted()).firstInstant();
+      if (earliestPendingCommit.isPresent()
+          && (!earliestCommitToRetain.isPresent()
+          || compareTimestamps(earliestPendingCommit.get().requestedTime(), 
LESSER_THAN, earliestCommitToRetain.get().requestedTime()))) {
+        earliestCommitToRetain = 
completedCommitsTimeline.findInstantsBefore(earliestPendingCommit.get().requestedTime()).lastInstant();

Review Comment:
   If the earliest pending commit precedes all completed commits, 
`findInstantsBefore(...).lastInstant()` returns empty and the by-hours ECTR 
collapses to `Option.empty()`. Please confirm an empty ECTR means "retain 
everything" (not "clean everything") downstream, and add a test for the 
pending-older-than-all-completed case — the new test only covers a completed 
commit existing before the pending one.



##########
hudi-common/src/main/java/org/apache/hudi/common/util/CleanerUtils.java:
##########
@@ -151,8 +152,16 @@ public static Option<HoodieInstant> 
getEarliestCommitToRetain(
     } else if (cleaningPolicy == HoodieCleaningPolicy.KEEP_LATEST_BY_HOURS) {
       ZonedDateTime latestDateTime = ZonedDateTime.ofInstant(latestInstant, 
timeZone.getZoneId());
       String earliestTimeToRetain = 
TimelineUtils.formatDate(Date.from(latestDateTime.minusHours(hoursRetained).toInstant()));
-      earliestCommitToRetain = 
Option.fromJavaOptional(completedCommitsTimeline.getInstantsAsStream().filter(i 
-> compareTimestamps(i.requestedTime(),
-          GREATER_THAN_OR_EQUALS, earliestTimeToRetain)).findFirst());
+      earliestCommitToRetain = 
Option.fromJavaOptional(completedCommitsTimeline.getInstantsAsStream()
+          .filter(i -> compareTimestamps(i.requestedTime(), 
GREATER_THAN_OR_EQUALS, earliestTimeToRetain))

Review Comment:
   This still selects the first instant `>=` the cutoff, but the PR description 
says the fix is to pick the instant *at or before* the cutoff. That change 
isn't here, so the sparse-commit-density bug described in the issue isn't 
actually addressed — only the new pending-instant guard below is. Please 
reconcile the description with the code (or add the intended change).



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/timeline/versioning/v2/TimelineArchiverV2.java:
##########
@@ -236,7 +237,11 @@ private List<HoodieInstant> getCommitInstantsToArchive() 
throws IOException {
             config.getCleanerPolicy());
     
earliestInstantToRetainCandidates.add(earliestInstantToRetainForClustering);
 
-    // 4. If metadata table is enabled, do not archive instants which are more 
recent than the last compaction on the
+    // 4. If enabled, block archival based on ECTR from the last completed 
clean to ensure we don't archive
+    // commits that have data files that haven't been cleaned yet.
+    
earliestInstantToRetainCandidates.add(getEarliestInstantToRetainForClean(table, 
completedCommitsTimeline, config));

Review Comment:
   This makes TimelineArchiverV2 (LSM/v9) block on clean ECTR, inverting the 
previously-asserted behavior (the old test documented V1-only). Please confirm 
V1-only was an oversight rather than a deliberate LSM restriction. Note this 
also adds a new failure mode: V2 archival now aborts with `HoodieIOException` 
if the clean-metadata read fails (when the config is enabled).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to