fhan688 commented on code in PR #19041:
URL: https://github.com/apache/hudi/pull/19041#discussion_r3458690958


##########
hudi-common/src/main/java/org/apache/hudi/common/util/CleanerUtils.java:
##########
@@ -151,8 +152,16 @@ public static Option<HoodieInstant> 
getEarliestCommitToRetain(
     } else if (cleaningPolicy == HoodieCleaningPolicy.KEEP_LATEST_BY_HOURS) {
       ZonedDateTime latestDateTime = ZonedDateTime.ofInstant(latestInstant, 
timeZone.getZoneId());
       String earliestTimeToRetain = 
TimelineUtils.formatDate(Date.from(latestDateTime.minusHours(hoursRetained).toInstant()));
-      earliestCommitToRetain = 
Option.fromJavaOptional(completedCommitsTimeline.getInstantsAsStream().filter(i 
-> compareTimestamps(i.requestedTime(),
-          GREATER_THAN_OR_EQUALS, earliestTimeToRetain)).findFirst());
+      earliestCommitToRetain = 
Option.fromJavaOptional(completedCommitsTimeline.getInstantsAsStream()
+          .filter(i -> compareTimestamps(i.requestedTime(), 
GREATER_THAN_OR_EQUALS, earliestTimeToRetain))
+          .findFirst());
+
+      Option<HoodieInstant> earliestPendingCommit = commitsTimeline.filter(s 
-> !s.isCompleted()).firstInstant();
+      if (earliestPendingCommit.isPresent()
+          && (!earliestCommitToRetain.isPresent()
+          || compareTimestamps(earliestPendingCommit.get().requestedTime(), 
LESSER_THAN, earliestCommitToRetain.get().requestedTime()))) {
+        earliestCommitToRetain = 
completedCommitsTimeline.findInstantsBefore(earliestPendingCommit.get().requestedTime()).lastInstant();

Review Comment:
   > > > Yes, findInstantsBefore(...).lastInstant() can return empty in that 
case. Downstream, an empty ECTR means no partition scan/no cleaning for 
commit/hour based cleanin
   > > 
   > > 
   > > Could you elaborate why the cleaing be blocked by a pending instant 
here? since we have completion time file slicing(pending files always belong to 
the latest one), it is okey to forward the cleaning, and the ECTR should be 
based on completion time too?
   > 
   > Thanks, this is a good point. I will re-check this against the current 
completion-time file slicing behavior. If pending files are always attached to 
the latest slice and the cleaner already preserves that slice, the by- hours 
pending guard may be overly conservative.
   > 
   > I’ll verify this with a clean-plan level test instead of only testing 
`CleanerUtils`. I’ll also check whether the by-hours ECTR should be derived 
from completion time rather than requested time. If the existing file- slice 
protection is sufficient, I’ll remove the pending guard change and update the 
PR scope accordingly.
   
   Additional Information:
   I validated the file-slice protection path. Pending log files with no 
completion time are assigned to the latest file slice, and the cleaner retains 
the latest slice, so forcing a by-hours ECTR after the pending instant
     does not schedule the pending log for deletion. The pending-files check in 
CleanPlanner mainly protects partition deletion.
   
     That said, your point about by-hours ECTR being based on completion time 
looks valid. KEEP_LATEST_BY_HOURS currently computes ECTR by filtering 
completed instants on requestedTime(), while timeline v2 exposes 
completion-time ordering. I think we should avoid relying on a global pending 
guard for file deletion and instead focus this PR on making by-hours ECTR 
completion-time based, plus any archive protection needed around that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to