yihua commented on code in PR #19041:
URL: https://github.com/apache/hudi/pull/19041#discussion_r3445501704


##########
hudi-common/src/main/java/org/apache/hudi/common/util/CleanerUtils.java:
##########
@@ -151,8 +152,16 @@ public static Option<HoodieInstant> 
getEarliestCommitToRetain(
     } else if (cleaningPolicy == HoodieCleaningPolicy.KEEP_LATEST_BY_HOURS) {
       ZonedDateTime latestDateTime = ZonedDateTime.ofInstant(latestInstant, 
timeZone.getZoneId());
       String earliestTimeToRetain = 
TimelineUtils.formatDate(Date.from(latestDateTime.minusHours(hoursRetained).toInstant()));
-      earliestCommitToRetain = 
Option.fromJavaOptional(completedCommitsTimeline.getInstantsAsStream().filter(i 
-> compareTimestamps(i.requestedTime(),
-          GREATER_THAN_OR_EQUALS, earliestTimeToRetain)).findFirst());
+      earliestCommitToRetain = 
Option.fromJavaOptional(completedCommitsTimeline.getInstantsAsStream()
+          .filter(i -> compareTimestamps(i.requestedTime(), 
GREATER_THAN_OR_EQUALS, earliestTimeToRetain))

Review Comment:
   This still selects the first instant `>=` the cutoff, but the PR description 
says the fix is to pick the instant at or before the cutoff. Should this be 
selecting the instant at or before the cutoff instead? As written, only the new 
pending-instant guard below changes behavior, so I am not sure the 
sparse-commit-density case from the issue is covered. Could the description and 
code be reconciled here?



##########
hudi-common/src/main/java/org/apache/hudi/common/util/CleanerUtils.java:
##########
@@ -151,8 +152,16 @@ public static Option<HoodieInstant> 
getEarliestCommitToRetain(
     } else if (cleaningPolicy == HoodieCleaningPolicy.KEEP_LATEST_BY_HOURS) {
       ZonedDateTime latestDateTime = ZonedDateTime.ofInstant(latestInstant, 
timeZone.getZoneId());
       String earliestTimeToRetain = 
TimelineUtils.formatDate(Date.from(latestDateTime.minusHours(hoursRetained).toInstant()));
-      earliestCommitToRetain = 
Option.fromJavaOptional(completedCommitsTimeline.getInstantsAsStream().filter(i 
-> compareTimestamps(i.requestedTime(),
-          GREATER_THAN_OR_EQUALS, earliestTimeToRetain)).findFirst());
+      earliestCommitToRetain = 
Option.fromJavaOptional(completedCommitsTimeline.getInstantsAsStream()
+          .filter(i -> compareTimestamps(i.requestedTime(), 
GREATER_THAN_OR_EQUALS, earliestTimeToRetain))
+          .findFirst());
+
+      Option<HoodieInstant> earliestPendingCommit = commitsTimeline.filter(s 
-> !s.isCompleted()).firstInstant();
+      if (earliestPendingCommit.isPresent()
+          && (!earliestCommitToRetain.isPresent()
+          || compareTimestamps(earliestPendingCommit.get().requestedTime(), 
LESSER_THAN, earliestCommitToRetain.get().requestedTime()))) {
+        earliestCommitToRetain = 
completedCommitsTimeline.findInstantsBefore(earliestPendingCommit.get().requestedTime()).lastInstant();

Review Comment:
   If the earliest pending commit precedes all completed commits, would 
`findInstantsBefore(...).lastInstant()` return empty here, collapsing the 
by-hours ECTR to `Option.empty()`? If so, does an empty ECTR mean "retain 
everything" rather than "clean everything" downstream? Might it be worth a test 
for the pending-older-than-all-completed case too, since the new test only 
covers a completed commit existing before the pending one?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/timeline/versioning/v2/TimelineArchiverV2.java:
##########
@@ -236,7 +237,11 @@ private List<HoodieInstant> getCommitInstantsToArchive() 
throws IOException {
             config.getCleanerPolicy());
     
earliestInstantToRetainCandidates.add(earliestInstantToRetainForClustering);
 
-    // 4. If metadata table is enabled, do not archive instants which are more 
recent than the last compaction on the
+    // 4. If enabled, block archival based on ECTR from the last completed 
clean to ensure we don't archive
+    // commits that have data files that haven't been cleaned yet.
+    
earliestInstantToRetainCandidates.add(getEarliestInstantToRetainForClean(table, 
completedCommitsTimeline, config));

Review Comment:
   This makes TimelineArchiverV2 (LSM/v9) block on clean ECTR, which inverts 
the previously-asserted behavior (the old test documented V1-only). Was V1-only 
an oversight rather than a deliberate LSM restriction? One thing to watch: V2 
archival now also aborts with `HoodieIOException` if the clean-metadata read 
fails (when the config is enabled). Is that an acceptable new failure mode here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to