[GitHub] [hudi] SteNicholas commented on a diff in pull request #7568: [HUDI-5341] CleanPlanner retains earliest commits must not be later than earliest pending commit

2022-12-29 Thread GitBox


SteNicholas commented on code in PR #7568:
URL: https://github.com/apache/hudi/pull/7568#discussion_r1058910341


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java:
##
@@ -432,6 +433,11 @@ private Stream getCommitInstantsToArchive() 
{
   table.getActiveTimeline(), 
config.getInlineCompactDeltaCommitMax())
   : Option.empty();
 
+  // The clustering commit instant can not be archived unless we ensure 
that the replaced files have been cleaned,
+  // without the replaced files metadata on the timeline, the fs view 
would expose duplicates for readers.
+  Option oldestInstantToRetainForClustering =

Review Comment:
   @leesf, refer to the naming of `oldestInstantToRetainForCompaction`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] SteNicholas commented on a diff in pull request #7568: [HUDI-5341] CleanPlanner retains earliest commits must not be later than earliest pending commit

2022-12-28 Thread GitBox


SteNicholas commented on code in PR #7568:
URL: https://github.com/apache/hudi/pull/7568#discussion_r1058196194


##
hudi-common/src/main/java/org/apache/hudi/common/util/ClusteringUtils.java:
##
@@ -224,4 +225,36 @@ public static List 
getPendingClusteringInstantTimes(HoodieTableMe
   public static boolean isPendingClusteringInstant(HoodieTableMetaClient 
metaClient, HoodieInstant instant) {
 return getClusteringPlan(metaClient, instant).isPresent();
   }
+
+  /**
+   * Checks whether the latest clustering instant has a subsequent cleaning 
action. Returns
+   * the clustering instant if there is such cleaning action or empty.
+   *
+   * @param activeTimeline The active timeline
+   * @return the oldest instant to retain for clustering
+   */
+  public static Option 
getOldestInstantToRetainForClustering(HoodieActiveTimeline activeTimeline)
+  throws IOException {
+Option cleanInstantOpt =
+activeTimeline.getCleanerTimeline().filter(instant -> 
!instant.isCompleted()).firstInstant();
+if (cleanInstantOpt.isPresent()) {
+  // The first clustering instant of which timestamp is greater than or 
equal to the earliest commit to retain of
+  // the clean metadata.

Review Comment:
   @danny0405, +1, I have added the check whether the completed replace 
timeline is empty.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] SteNicholas commented on a diff in pull request #7568: [HUDI-5341] CleanPlanner retains earliest commits must not be later than earliest pending commit

2022-12-28 Thread GitBox


SteNicholas commented on code in PR #7568:
URL: https://github.com/apache/hudi/pull/7568#discussion_r1058192365


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java:
##
@@ -487,7 +487,24 @@ public Option getEarliestCommitToRetain() {
 int hoursRetained = config.getCleanerHoursRetained();
 if (config.getCleanerPolicy() == HoodieCleaningPolicy.KEEP_LATEST_COMMITS
 && commitTimeline.countInstants() > commitsRetained) {
-  earliestCommitToRetain = 
commitTimeline.nthInstant(commitTimeline.countInstants() - commitsRetained); 
//15 instants total, 10 commits to retain, this gives 6th instant in the list
+  Option earliestPendingCommits = 
hoodieTable.getMetaClient()
+  .getActiveTimeline()
+  .getCommitsTimeline()
+  .filter(s -> !s.isCompleted()).firstInstant();
+  if (earliestPendingCommits.isPresent()) {
+// Earliest commit to retain must not be later than the earliest 
pending commit
+earliestCommitToRetain =
+commitTimeline.nthInstant(commitTimeline.countInstants() - 
commitsRetained).map(nthInstant -> {
+  if (nthInstant.compareTo(earliestPendingCommits.get()) <= 0) {

Review Comment:
   @danny0405, the `getPartitionPathsForFullCleaning` doesn't invoke this 
method.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] SteNicholas commented on a diff in pull request #7568: [HUDI-5341] CleanPlanner retains earliest commits must not be later than earliest pending commit

2022-12-28 Thread GitBox


SteNicholas commented on code in PR #7568:
URL: https://github.com/apache/hudi/pull/7568#discussion_r1058184448


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java:
##
@@ -487,7 +487,24 @@ public Option getEarliestCommitToRetain() {
 int hoursRetained = config.getCleanerHoursRetained();
 if (config.getCleanerPolicy() == HoodieCleaningPolicy.KEEP_LATEST_COMMITS
 && commitTimeline.countInstants() > commitsRetained) {
-  earliestCommitToRetain = 
commitTimeline.nthInstant(commitTimeline.countInstants() - commitsRetained); 
//15 instants total, 10 commits to retain, this gives 6th instant in the list
+  Option earliestPendingCommits = 
hoodieTable.getMetaClient()
+  .getActiveTimeline()
+  .getCommitsTimeline()
+  .filter(s -> !s.isCompleted()).firstInstant();
+  if (earliestPendingCommits.isPresent()) {
+// Earliest commit to retain must not be later than the earliest 
pending commit
+earliestCommitToRetain =
+commitTimeline.nthInstant(commitTimeline.countInstants() - 
commitsRetained).map(nthInstant -> {
+  if (nthInstant.compareTo(earliestPendingCommits.get()) <= 0) {

Review Comment:
   @danny0405, this check is to ensure the earliest commit to retain must not 
be later than the earliest pending commit. The cleaner couldn't clean 
uncompleted instant and should keep the linear cleaning for incremental clean 
mode.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org