fhan688 commented on code in PR #19041:
URL: https://github.com/apache/hudi/pull/19041#discussion_r3458690958
##########
hudi-common/src/main/java/org/apache/hudi/common/util/CleanerUtils.java:
##########
@@ -151,8 +152,16 @@ public static Option<HoodieInstant>
getEarliestCommitToRetain(
} else if (cleaningPolicy == HoodieCleaningPolicy.KEEP_LATEST_BY_HOURS) {
ZonedDateTime latestDateTime = ZonedDateTime.ofInstant(latestInstant,
timeZone.getZoneId());
String earliestTimeToRetain =
TimelineUtils.formatDate(Date.from(latestDateTime.minusHours(hoursRetained).toInstant()));
- earliestCommitToRetain =
Option.fromJavaOptional(completedCommitsTimeline.getInstantsAsStream().filter(i
-> compareTimestamps(i.requestedTime(),
- GREATER_THAN_OR_EQUALS, earliestTimeToRetain)).findFirst());
+ earliestCommitToRetain =
Option.fromJavaOptional(completedCommitsTimeline.getInstantsAsStream()
+ .filter(i -> compareTimestamps(i.requestedTime(),
GREATER_THAN_OR_EQUALS, earliestTimeToRetain))
+ .findFirst());
+
+ Option<HoodieInstant> earliestPendingCommit = commitsTimeline.filter(s
-> !s.isCompleted()).firstInstant();
+ if (earliestPendingCommit.isPresent()
+ && (!earliestCommitToRetain.isPresent()
+ || compareTimestamps(earliestPendingCommit.get().requestedTime(),
LESSER_THAN, earliestCommitToRetain.get().requestedTime()))) {
+ earliestCommitToRetain =
completedCommitsTimeline.findInstantsBefore(earliestPendingCommit.get().requestedTime()).lastInstant();
Review Comment:
> > > Yes, findInstantsBefore(...).lastInstant() can return empty in that
case. Downstream, an empty ECTR means no partition scan/no cleaning for
commit/hour based cleanin
> >
> >
> > Could you elaborate why the cleaing be blocked by a pending instant
here? since we have completion time file slicing(pending files always belong to
the latest one), it is okey to forward the cleaning, and the ECTR should be
based on completion time too?
>
> Thanks, this is a good point. I will re-check this against the current
completion-time file slicing behavior. If pending files are always attached to
the latest slice and the cleaner already preserves that slice, the by- hours
pending guard may be overly conservative.
>
> I’ll verify this with a clean-plan level test instead of only testing
`CleanerUtils`. I’ll also check whether the by-hours ECTR should be derived
from completion time rather than requested time. If the existing file- slice
protection is sufficient, I’ll remove the pending guard change and update the
PR scope accordingly.
Additional Information:
I validated the file-slice protection path. Pending log files with no
completion time are assigned to the latest file slice, and the cleaner retains
the latest slice, so forcing a by-hours ECTR after the pending instant
does not schedule the pending log for deletion. The pending-files check in
CleanPlanner mainly protects partition deletion.
That said, your point about by-hours ECTR being based on completion time
looks valid. KEEP_LATEST_BY_HOURS currently computes ECTR by filtering
completed instants on requestedTime(), while timeline v2 exposes
completion-time ordering. I think we should avoid relying on a global pending
guard for file deletion and instead focus this PR on making by-hours ECTR
completion-time based, plus any archive protection needed around that.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]