Re: [PR] [HUDI-7779] Guard archival on savepoint removal until cleaner is able to clean it up [hudi]

2024-07-11 Thread via GitHub


danny0405 commented on code in PR #11440:
URL: https://github.com/apache/hudi/pull/11440#discussion_r1674899742


##
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieTimeline.java:
##
@@ -437,6 +437,15 @@ static boolean compareTimestamps(String commit1, 
BiPredicate pre
 return predicateToApply.test(commit1, commit2);
   }
 
+  static String minTimestamp(String commit1, String commit2) {
+if (StringUtils.isNullOrEmpty(commit1)) {
+  return commit2;
+} else if (StringUtils.isNullOrEmpty(commit2)) {
+  return commit1;
+}
+return GREATER_THAN.test(commit1, commit2) ? commit2 : commit1;
+  }
+
   /**

Review Comment:
   Should we reuse `minInstant` instead?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7779] Guard archival on savepoint removal until cleaner is able to clean it up [hudi]

2024-07-11 Thread via GitHub


danny0405 commented on code in PR #11440:
URL: https://github.com/apache/hudi/pull/11440#discussion_r1674895597


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java:
##
@@ -149,17 +150,24 @@ HoodieCleanerPlan requestClean(HoodieEngineContext 
context) {
   .map(x -> new HoodieActionInstant(x.getTimestamp(), x.getAction(), 
x.getState().name())).orElse(null),
   planner.getLastCompletedCommitTimestamp(),
   config.getCleanerPolicy().name(), Collections.emptyMap(),
-  CleanPlanner.LATEST_CLEAN_PLAN_VERSION, cleanOps, 
partitionsToDelete, prepareExtraMetadata(planner.getSavepointedTimestamps()));
+  CleanPlanner.LATEST_CLEAN_PLAN_VERSION, cleanOps, 
partitionsToDelete, prepareExtraMetadata(planner.getSavepointedTimestamps(), 
planner.getEarliestCommitToNotArchive()));
 } catch (IOException e) {
   throw new HoodieIOException("Failed to schedule clean operation", e);
 }
   }
 
-  private Map prepareExtraMetadata(List 
savepointedTimestamps) {
-if (savepointedTimestamps.isEmpty()) {
+  private Map prepareExtraMetadata(List 
savepointedTimestamps, Option earliestCommitToNotArchive) {
+if (savepointedTimestamps.isEmpty() && 
!earliestCommitToNotArchive.isPresent()) {
   return Collections.emptyMap();
 } else {
-  return Collections.singletonMap(SAVEPOINTED_TIMESTAMPS, 
savepointedTimestamps.stream().collect(Collectors.joining(",")));
+  Map extraMetadata = new HashMap<>();
+  if (!savepointedTimestamps.isEmpty()) {
+extraMetadata.put(SAVEPOINTED_TIMESTAMPS, 
savepointedTimestamps.stream().collect(Collectors.joining(",")));
+  }
+  if (earliestCommitToNotArchive.isPresent()) {
+extraMetadata.put(EARLIEST_COMMIT_TO_NOT_ARCHIVE, 
earliestCommitToNotArchive.get());

Review Comment:
   Can we put the name more straight-forward like `SKIPPED_REPLACED_SAVEPOINTS`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7779] Guard archival on savepoint removal until cleaner is able to clean it up [hudi]

2024-07-11 Thread via GitHub


danny0405 commented on code in PR #11440:
URL: https://github.com/apache/hudi/pull/11440#discussion_r1674898261


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java:
##
@@ -553,13 +552,62 @@ public Pair> 
getDeletePaths(String partitionPath, O
* Returns the earliest commit to retain based on cleaning policy.
*/
   public Option getEarliestCommitToRetain() {
-return CleanerUtils.getEarliestCommitToRetain(
-
hoodieTable.getMetaClient().getActiveTimeline().getCommitsAndCompactionTimeline(),
-config.getCleanerPolicy(),
-config.getCleanerCommitsRetained(),
-Instant.now(),
-config.getCleanerHoursRetained(),
-hoodieTable.getMetaClient().getTableConfig().getTimelineTimezone());
+if (!earliestCommitToRetain.isPresent()) {
+  earliestCommitToRetain = CleanerUtils.getEarliestCommitToRetain(
+  
hoodieTable.getMetaClient().getActiveTimeline().getCommitsAndCompactionTimeline(),
+  config.getCleanerPolicy(),
+  config.getCleanerCommitsRetained(),
+  Instant.now(),
+  config.getCleanerHoursRetained(),
+  hoodieTable.getMetaClient().getTableConfig().getTimelineTimezone());
+}
+return earliestCommitToRetain;
+  }
+
+  /**
+   * Returns earliest commit to not archive to assist with guarding archival.
+   * @return
+   */
+  public Option getEarliestCommitToNotArchive() {
+// compute only if not set
+if (!earliestCommitToNotArchiveOpt.isPresent() && 
config.getCleanerPolicy() != HoodieCleaningPolicy.KEEP_LATEST_FILE_VERSIONS) { 
// earliest commit to retain and earliest commit to not archive
+  // makes sense only when clean policy is num commits based or hours 
based.
+  Option earliestToRetain = getEarliestCommitToRetain();
+  String earliestCommitToNotArchive = null;
+  if (!savepointedTimestamps.isEmpty()) { // if savepoints are found.
+// find the first savepoint timestamp
+String firstSavepoint = 
savepointedTimestamps.stream().sorted().findFirst().get();
+earliestCommitToNotArchive = firstSavepoint;
+
+// earliest commit to not archive = min (earliest commit to retain, 
earliest commit to not archive)
+if (earliestToRetain.isPresent()) {
+  String earliestToRetainTimestamp = 
earliestToRetain.get().getTimestamp();
+  // When earliestToRetainTimestamp is greater than first savepoint 
timestamp but there are no
+  // replace commits between the first savepoint and the 
earliestToRetainTimestamp, we can set the
+  // earliestCommitToNotArchive to earliestToRetainTimestamp
+  if (HoodieTimeline.compareTimestamps(earliestToRetainTimestamp, 
HoodieTimeline.GREATER_THAN, firstSavepoint)) {
+HoodieTimeline replaceTimeline = 
hoodieTable.getActiveTimeline().getCompletedReplaceTimeline().findInstantsInClosedRange(firstSavepoint,
 earliestToRetainTimestamp);
+earliestCommitToNotArchive = replaceTimeline.empty() ? 
earliestToRetainTimestamp : firstSavepoint;
+LOG.info(String.format("Setting Earliest Commit to Not Archive to 
%s since replace timeline is empty, earliestCommitToRetain: %s, total list of 
savepointed timestamps : %s",
+earliestCommitToNotArchive, earliestToRetain, 
Arrays.toString(savepointedTimestamps.toArray(;
+  } else {
+earliestCommitToNotArchive = earliestToRetainTimestamp;
+LOG.info(String.format("Setting Earliest Commit to Not Archive to 
%s, earliestCommitToRetain: %s, total list of savepointed timestamps : %s", 
earliestCommitToNotArchive,
+earliestToRetain, 
Arrays.toString(savepointedTimestamps.toArray(;
+  }
+}
+  } else {
+// no savepoints found.
+// lets set the value regardless. Would help us differentiate older 
versions of hudi where this will not be set vs latest version where this will 
be set to either earliest savepoint
+// or the earliest commit to retain.
+if (earliestToRetain.isPresent()) {
+  LOG.info(String.format("Setting Earliest Commit to Not Archive same 
as Earliest commit to retain to %s", earliestToRetain.get().getTimestamp()));
+  earliestCommitToNotArchive = earliestToRetain.get().getTimestamp();
+}
+  }
+  earliestCommitToNotArchiveOpt = 
Option.ofNullable(earliestCommitToNotArchive);

Review Comment:
   Can we use a separate variable like `skippedReplacedSavepoints` apart from 
the `earliestToRetainTimestamp` ? And let the archiver to decide which one to 
use.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7779] Guard archival on savepoint removal until cleaner is able to clean it up [hudi]

2024-07-11 Thread via GitHub


danny0405 commented on code in PR #11440:
URL: https://github.com/apache/hudi/pull/11440#discussion_r1674895597


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java:
##
@@ -149,17 +150,24 @@ HoodieCleanerPlan requestClean(HoodieEngineContext 
context) {
   .map(x -> new HoodieActionInstant(x.getTimestamp(), x.getAction(), 
x.getState().name())).orElse(null),
   planner.getLastCompletedCommitTimestamp(),
   config.getCleanerPolicy().name(), Collections.emptyMap(),
-  CleanPlanner.LATEST_CLEAN_PLAN_VERSION, cleanOps, 
partitionsToDelete, prepareExtraMetadata(planner.getSavepointedTimestamps()));
+  CleanPlanner.LATEST_CLEAN_PLAN_VERSION, cleanOps, 
partitionsToDelete, prepareExtraMetadata(planner.getSavepointedTimestamps(), 
planner.getEarliestCommitToNotArchive()));
 } catch (IOException e) {
   throw new HoodieIOException("Failed to schedule clean operation", e);
 }
   }
 
-  private Map prepareExtraMetadata(List 
savepointedTimestamps) {
-if (savepointedTimestamps.isEmpty()) {
+  private Map prepareExtraMetadata(List 
savepointedTimestamps, Option earliestCommitToNotArchive) {
+if (savepointedTimestamps.isEmpty() && 
!earliestCommitToNotArchive.isPresent()) {
   return Collections.emptyMap();
 } else {
-  return Collections.singletonMap(SAVEPOINTED_TIMESTAMPS, 
savepointedTimestamps.stream().collect(Collectors.joining(",")));
+  Map extraMetadata = new HashMap<>();
+  if (!savepointedTimestamps.isEmpty()) {
+extraMetadata.put(SAVEPOINTED_TIMESTAMPS, 
savepointedTimestamps.stream().collect(Collectors.joining(",")));
+  }
+  if (earliestCommitToNotArchive.isPresent()) {
+extraMetadata.put(EARLIEST_COMMIT_TO_NOT_ARCHIVE, 
earliestCommitToNotArchive.get());

Review Comment:
   Can we put the name more straight-forward like `SKIPPED_SAVEPOINTED_COMMITS`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7779] Guard archival on savepoint removal until cleaner is able to clean it up [hudi]

2024-07-11 Thread via GitHub


danny0405 commented on code in PR #11440:
URL: https://github.com/apache/hudi/pull/11440#discussion_r1674893075


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/timeline/HoodieTimelineArchiver.java:
##
@@ -264,20 +268,46 @@ private List getCommitInstantsToArchive() 
throws IOException {
 .map(Option::get)
 .min(HoodieInstant.COMPARATOR);
 
-// Step2: We cannot archive any commits which are made after the first 
savepoint present,
+// Step2: We cannot archive any commits which are later than 
EarliestCommitToNotArchive.
 // unless HoodieArchivalConfig#ARCHIVE_BEYOND_SAVEPOINT is enabled.
-Option firstSavepoint = 
table.getCompletedSavepointTimeline().firstInstant();
+
+// EarliestCommitToNotArchive is required to block archival when savepoint 
is deleted.
+// This ensures that archival is blocked until clean has cleaned up files 
retained due to savepoint.
+// Since EarliestCommitToNotArchive is advanced by cleaner, it is a way 
for cleaner to notify the archival
+// that cleanup has finished and archival can advance further.
+// If this guard is not present, the archival of commits can lead to 
duplicates. Here is a scenario
+// illustrating the same. This scenario considers a case where 
EarliestCommitToNotArchive guard is not present
+// c1.dc - f1 (c1 deltacommit creates file with id f1)
+// c2.dc - f2 (c2 deltacommit creates file with id f2)
+// c2.sp - Savepoint at c2
+// c3.rc (replacing f2 -> f3) (Replace commit replacing file id f2 with f3)
+// c4.dc
+//
+// Lets say Incremental cleaner moved past the c3.rc without cleaning f2 
since savepoint is created at c2.
+// Archival is blocked at c2 since there is a savepoint at c2.
+// Lets say the savepoint at c2 is now deleted, Archival would archive 
c3.rc since it is unblocked now.
+// Since c3 is archived and f2 has not been cleaned, the table view would 
be considering f2 as a valid
+// file id. This causes duplicates.

Review Comment:
   Can we put the logic inside `getEarliestInstantToRetainForClustering` to 
make the logic around replace_commit more concentrated and there is no need to 
go through the replace_commit timeline multiple times?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7779] Guard archival on savepoint removal until cleaner is able to clean it up [hudi]

2024-07-11 Thread via GitHub


hudi-bot commented on PR #11440:
URL: https://github.com/apache/hudi/pull/11440#issuecomment-704418

   
   ## CI report:
   
   * 15b2428c2dd7d7980fd98b78643811652e08fd35 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24823)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7779] Guard archival on savepoint removal until cleaner is able to clean it up [hudi]

2024-07-11 Thread via GitHub


hudi-bot commented on PR #11440:
URL: https://github.com/apache/hudi/pull/11440#issuecomment-595823

   
   ## CI report:
   
   * 7e8e5336ab1f079d93ccaa0981e70798169532f1 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24807)
 
   * 15b2428c2dd7d7980fd98b78643811652e08fd35 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24823)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7779] Guard archival on savepoint removal until cleaner is able to clean it up [hudi]

2024-07-11 Thread via GitHub


hudi-bot commented on PR #11440:
URL: https://github.com/apache/hudi/pull/11440#issuecomment-511800

   
   ## CI report:
   
   * 7e8e5336ab1f079d93ccaa0981e70798169532f1 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24807)
 
   * 15b2428c2dd7d7980fd98b78643811652e08fd35 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7779] Guard archival on savepoint removal until cleaner is able to clean it up [hudi]

2024-07-11 Thread via GitHub


lokeshj1703 commented on PR #11440:
URL: https://github.com/apache/hudi/pull/11440#issuecomment-500128

   I have added the optimisation. @danny0405 @nsivabalan PTAL.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7779] Guard archival on savepoint removal until cleaner is able to clean it up [hudi]

2024-07-10 Thread via GitHub


danny0405 commented on PR #11440:
URL: https://github.com/apache/hudi/pull/11440#issuecomment-2221755217

   > did you guys sync up on that. if not, can we sync up and make forward 
progress.
   
   I thought we have synced up and reached concensus in the last stand-up 
meeting, no?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7779] Guard archival on savepoint removal until cleaner is able to clean it up [hudi]

2024-07-10 Thread via GitHub


nsivabalan commented on PR #11440:
URL: https://github.com/apache/hudi/pull/11440#issuecomment-2221624233

   @danny0405 @lokeshj1703 : are we aligned on any optimization we need on top 
of the patch? for eg, we were discussing about replace commit timeline right? 
did you guys sync up on that. if not, can we sync up and make forward progress. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7779] Guard archival on savepoint removal until cleaner is able to clean it up [hudi]

2024-07-10 Thread via GitHub


hudi-bot commented on PR #11440:
URL: https://github.com/apache/hudi/pull/11440#issuecomment-2221276113

   
   ## CI report:
   
   * 7e8e5336ab1f079d93ccaa0981e70798169532f1 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24807)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7779] Guard archival on savepoint removal until cleaner is able to clean it up [hudi]

2024-07-10 Thread via GitHub


hudi-bot commented on PR #11440:
URL: https://github.com/apache/hudi/pull/11440#issuecomment-2221109430

   
   ## CI report:
   
   * 4bb072faa37b2ea398144fbbc24deff966153cfa Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24471)
 
   * 7e8e5336ab1f079d93ccaa0981e70798169532f1 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24807)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7779] Guard archival on savepoint removal until cleaner is able to clean it up [hudi]

2024-07-10 Thread via GitHub


hudi-bot commented on PR #11440:
URL: https://github.com/apache/hudi/pull/11440#issuecomment-2221096095

   
   ## CI report:
   
   * 4bb072faa37b2ea398144fbbc24deff966153cfa Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24471)
 
   * 7e8e5336ab1f079d93ccaa0981e70798169532f1 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7779] Guard archival on savepoint removal until cleaner is able to clean it up [hudi]

2024-07-10 Thread via GitHub


lokeshj1703 commented on PR #11440:
URL: https://github.com/apache/hudi/pull/11440#issuecomment-2221025516

   I have explained the scenario in the comments. PTAL @nbalajee @danny0405 
   I have also fixed the earliestCommitToNotArchive logic to take into account 
last clean instant and removed separate handling in the cleaner timeline 
archival. The case where earliestCommitToNotArchive > last clean instant will 
not be possible since this timestamp is generated by cleaner but still added a 
check to consider the minimum of last clean instant and 
earliestCommitToNotArchive.
   cc @nsivabalan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7779] Guard archival on savepoint removal until cleaner is able to clean it up [hudi]

2024-07-08 Thread via GitHub


nbalajee commented on PR #11440:
URL: https://github.com/apache/hudi/pull/11440#issuecomment-2215519646

   Let's call out that for replacecommit (clustering operation with replace 
fileIDs), archiving a replacecommit before the replaced files are cleaned, 
would result in a data quality issue for the query engines.   Queries would 
include the replaced fileIDs (considering them as valid, as they are archived), 
which may manifest as duplicates in query results.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7779] Guard archival on savepoint removal until cleaner is able to clean it up [hudi]

2024-07-02 Thread via GitHub


danny0405 commented on PR #11440:
URL: https://github.com/apache/hudi/pull/11440#issuecomment-2204767885

   > Whose value will either refer to first savepoint if first savepoint < 
earliest commit to retain.
   
   Does this mean we can not archive beyond any savepoint now?
   
   > Last completed clean instant is required in the timeline to fetch earliest 
commit to not archive.
   
   I don't think this is reasonable in high level, there are some scenarios 
that people will never enable cleaning service, or the cleaning just got stuck 
and the halt of archival would impact the performance of ingestion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7779] Guard archival on savepoint removal until cleaner is able to clean it up [hudi]

2024-06-17 Thread via GitHub


jonvex commented on code in PR #11440:
URL: https://github.com/apache/hudi/pull/11440#discussion_r1642831308


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/timeline/HoodieTimelineArchiver.java:
##
@@ -362,6 +367,25 @@ private boolean deleteArchivedInstants(List 
activeActions, HoodieE
 return true;
   }
 
+  private Option getEarliestCommitToNotArchive(HoodieActiveTimeline 
activeTimeline, HoodieTableMetaClient metaClient) throws IOException {
+// if clean policy is based on file versions, earliest commit to not 
archive may not be set. So, we have to explicitly check the savepoint timeline
+// and guard against the first one.
+String earliestInstantToNotArchive = 
table.getCompletedSavepointTimeline().firstInstant().map(HoodieInstant::getTimestamp).orElse(null);
+if (config.getCleanerPolicy() != 
HoodieCleaningPolicy.KEEP_LATEST_FILE_VERSIONS) {
+  Option cleanInstantOpt =
+  
activeTimeline.getCleanerTimeline().filterCompletedInstants().lastInstant();
+  if (cleanInstantOpt.isPresent()) {
+HoodieInstant cleanInstant = cleanInstantOpt.get();
+String cleanerEarliestInstantToNotArchive = 
CleanerUtils.getCleanerPlan(metaClient, cleanInstant.isRequested()
+? cleanInstant
+: 
HoodieTimeline.getCleanRequestedInstant(cleanInstant.getTimestamp()))
+
.getExtraMetadata().getOrDefault(CleanPlanner.EARLIEST_COMMIT_TO_NOT_ARCHIVE, 
null);

Review Comment:
   getExtraMetadata() can return null and will NPE here. Check if the map 
exists before doing getOrDefault



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7779] Guard archival on savepoint removal until cleaner is able to clean it up [hudi]

2024-06-13 Thread via GitHub


danny0405 commented on PR #11440:
URL: https://github.com/apache/hudi/pull/11440#issuecomment-2165026195

   @nsivabalan Is this the fix we talked before for savepoint deletion? Do we 
introduce cyclic dependency for clean and archiving.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7779] Guard archival on savepoint removal until cleaner is able to clean it up [hudi]

2024-06-12 Thread via GitHub


hudi-bot commented on PR #11440:
URL: https://github.com/apache/hudi/pull/11440#issuecomment-2163201030

   
   ## CI report:
   
   * 22d7e90f753b9cd7374312c8ff73851453cf9736 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24366)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7779] Guard archival on savepoint removal until cleaner is able to clean it up [hudi]

2024-06-12 Thread via GitHub


hudi-bot commented on PR #11440:
URL: https://github.com/apache/hudi/pull/11440#issuecomment-2163084989

   
   ## CI report:
   
   * 22d7e90f753b9cd7374312c8ff73851453cf9736 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24366)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7779] Guard archival on savepoint removal until cleaner is able to clean it up [hudi]

2024-06-12 Thread via GitHub


hudi-bot commented on PR #11440:
URL: https://github.com/apache/hudi/pull/11440#issuecomment-2163065510

   
   ## CI report:
   
   * 22d7e90f753b9cd7374312c8ff73851453cf9736 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org