[GitHub] [hudi] SteNicholas commented on a diff in pull request #7568: [HUDI-5341] CleanPlanner retains earliest commits must not be later than earliest pending commit
SteNicholas commented on code in PR #7568: URL: https://github.com/apache/hudi/pull/7568#discussion_r1058910341 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java: ## @@ -432,6 +433,11 @@ private Stream getCommitInstantsToArchive() { table.getActiveTimeline(), config.getInlineCompactDeltaCommitMax()) : Option.empty(); + // The clustering commit instant can not be archived unless we ensure that the replaced files have been cleaned, + // without the replaced files metadata on the timeline, the fs view would expose duplicates for readers. + Option oldestInstantToRetainForClustering = Review Comment: @leesf, refer to the naming of `oldestInstantToRetainForCompaction`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] SteNicholas commented on a diff in pull request #7568: [HUDI-5341] CleanPlanner retains earliest commits must not be later than earliest pending commit
SteNicholas commented on code in PR #7568: URL: https://github.com/apache/hudi/pull/7568#discussion_r1058196194 ## hudi-common/src/main/java/org/apache/hudi/common/util/ClusteringUtils.java: ## @@ -224,4 +225,36 @@ public static List getPendingClusteringInstantTimes(HoodieTableMe public static boolean isPendingClusteringInstant(HoodieTableMetaClient metaClient, HoodieInstant instant) { return getClusteringPlan(metaClient, instant).isPresent(); } + + /** + * Checks whether the latest clustering instant has a subsequent cleaning action. Returns + * the clustering instant if there is such cleaning action or empty. + * + * @param activeTimeline The active timeline + * @return the oldest instant to retain for clustering + */ + public static Option getOldestInstantToRetainForClustering(HoodieActiveTimeline activeTimeline) + throws IOException { +Option cleanInstantOpt = +activeTimeline.getCleanerTimeline().filter(instant -> !instant.isCompleted()).firstInstant(); +if (cleanInstantOpt.isPresent()) { + // The first clustering instant of which timestamp is greater than or equal to the earliest commit to retain of + // the clean metadata. Review Comment: @danny0405, +1, I have added the check whether the completed replace timeline is empty. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] SteNicholas commented on a diff in pull request #7568: [HUDI-5341] CleanPlanner retains earliest commits must not be later than earliest pending commit
SteNicholas commented on code in PR #7568: URL: https://github.com/apache/hudi/pull/7568#discussion_r1058192365 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java: ## @@ -487,7 +487,24 @@ public Option getEarliestCommitToRetain() { int hoursRetained = config.getCleanerHoursRetained(); if (config.getCleanerPolicy() == HoodieCleaningPolicy.KEEP_LATEST_COMMITS && commitTimeline.countInstants() > commitsRetained) { - earliestCommitToRetain = commitTimeline.nthInstant(commitTimeline.countInstants() - commitsRetained); //15 instants total, 10 commits to retain, this gives 6th instant in the list + Option earliestPendingCommits = hoodieTable.getMetaClient() + .getActiveTimeline() + .getCommitsTimeline() + .filter(s -> !s.isCompleted()).firstInstant(); + if (earliestPendingCommits.isPresent()) { +// Earliest commit to retain must not be later than the earliest pending commit +earliestCommitToRetain = +commitTimeline.nthInstant(commitTimeline.countInstants() - commitsRetained).map(nthInstant -> { + if (nthInstant.compareTo(earliestPendingCommits.get()) <= 0) { Review Comment: @danny0405, the `getPartitionPathsForFullCleaning` doesn't invoke this method. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] SteNicholas commented on a diff in pull request #7568: [HUDI-5341] CleanPlanner retains earliest commits must not be later than earliest pending commit
SteNicholas commented on code in PR #7568: URL: https://github.com/apache/hudi/pull/7568#discussion_r1058184448 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java: ## @@ -487,7 +487,24 @@ public Option getEarliestCommitToRetain() { int hoursRetained = config.getCleanerHoursRetained(); if (config.getCleanerPolicy() == HoodieCleaningPolicy.KEEP_LATEST_COMMITS && commitTimeline.countInstants() > commitsRetained) { - earliestCommitToRetain = commitTimeline.nthInstant(commitTimeline.countInstants() - commitsRetained); //15 instants total, 10 commits to retain, this gives 6th instant in the list + Option earliestPendingCommits = hoodieTable.getMetaClient() + .getActiveTimeline() + .getCommitsTimeline() + .filter(s -> !s.isCompleted()).firstInstant(); + if (earliestPendingCommits.isPresent()) { +// Earliest commit to retain must not be later than the earliest pending commit +earliestCommitToRetain = +commitTimeline.nthInstant(commitTimeline.countInstants() - commitsRetained).map(nthInstant -> { + if (nthInstant.compareTo(earliestPendingCommits.get()) <= 0) { Review Comment: @danny0405, this check is to ensure the earliest commit to retain must not be later than the earliest pending commit. The cleaner couldn't clean uncompleted instant and should keep the linear cleaning for incremental clean mode. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org