subject:"\[GitHub\] \[hudi\] rohan\-uptycs commented on a diff in pull request #8503\: \[HUDI\-6047\] Clustering operation on consistent hashing index resulting in duplicate data"

[GitHub] [hudi] rohan-uptycs commented on a diff in pull request #8503: [HUDI-6047] Clustering operation on consistent hashing index resulting in duplicate data

2023-05-09 Thread via GitHub



rohan-uptycs commented on code in PR #8503:
URL: https://github.com/apache/hudi/pull/8503#discussion_r1188382134


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndex.java:
##
@@ -154,6 +154,14 @@ public boolean requiresTagging(WriteOperationType 
operationType) {
   public void close() {
   }
 
+  /***
+   * Updates index metadata of the given table and instant if needed.
+   * @param table The committed table.
+   * @param hoodieInstant The instant to commit.
+   */
+  public void commitIndexMetadataIfNeeded(HoodieTable table, String 
hoodieInstant) {
+  }
+

Review Comment:
   done , removed the commitIndex interface. please review 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rohan-uptycs commented on a diff in pull request #8503: [HUDI-6047] Clustering operation on consistent hashing index resulting in duplicate data

2023-05-09 Thread via GitHub



rohan-uptycs commented on code in PR #8503:
URL: https://github.com/apache/hudi/pull/8503#discussion_r1188233833


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndex.java:
##
@@ -154,6 +154,14 @@ public boolean requiresTagging(WriteOperationType 
operationType) {
   public void close() {
   }
 
+  /***
+   * Updates index metadata of the given table and instant if needed.
+   * @param table The committed table.
+   * @param hoodieInstant The instant to commit.
+   */
+  public void commitIndexMetadataIfNeeded(HoodieTable table, String 
hoodieInstant) {
+  }
+

Review Comment:
   Yeah sure , let me add code change



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rohan-uptycs commented on a diff in pull request #8503: [HUDI-6047] Clustering operation on consistent hashing index resulting in duplicate data

2023-05-09 Thread via GitHub



rohan-uptycs commented on code in PR #8503:
URL: https://github.com/apache/hudi/pull/8503#discussion_r1188233833


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndex.java:
##
@@ -154,6 +154,14 @@ public boolean requiresTagging(WriteOperationType 
operationType) {
   public void close() {
   }
 
+  /***
+   * Updates index metadata of the given table and instant if needed.
+   * @param table The committed table.
+   * @param hoodieInstant The instant to commit.
+   */
+  public void commitIndexMetadataIfNeeded(HoodieTable table, String 
hoodieInstant) {
+  }
+

Review Comment:
   Yeah sure , let me add the patch



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rohan-uptycs commented on a diff in pull request #8503: [HUDI-6047] Clustering operation on consistent hashing index resulting in duplicate data

2023-05-07 Thread via GitHub



rohan-uptycs commented on code in PR #8503:
URL: https://github.com/apache/hudi/pull/8503#discussion_r1186796503


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndex.java:
##
@@ -154,6 +154,14 @@ public boolean requiresTagging(WriteOperationType 
operationType) {
   public void close() {
   }
 
+  /***
+   * Updates index metadata of the given table and instant if needed.
+   * @param table The committed table.
+   * @param hoodieInstant The instant to commit.
+   */
+  public void commitIndexMetadataIfNeeded(HoodieTable table, String 
hoodieInstant) {
+  }
+

Review Comment:
   Yeah in updateLocation itself we can't ensure atomicity of clustering 
operation. 
   In UpdateLocation we are already creating metadata through (saveMetadata), 
we can treat that metadata as in-flight metadata, and loadMetadata will decide 
if that metadata can be committed or not. I think we can use this mechanism to 
ensure consistency of metadata.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rohan-uptycs commented on a diff in pull request #8503: [HUDI-6047] Clustering operation on consistent hashing index resulting in duplicate data

2023-05-06 Thread via GitHub



rohan-uptycs commented on code in PR #8503:
URL: https://github.com/apache/hudi/pull/8503#discussion_r1186689133


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndex.java:
##
@@ -154,6 +154,14 @@ public boolean requiresTagging(WriteOperationType 
operationType) {
   public void close() {
   }
 
+  /***
+   * Updates index metadata of the given table and instant if needed.
+   * @param table The committed table.
+   * @param hoodieInstant The instant to commit.
+   */
+  public void commitIndexMetadataIfNeeded(HoodieTable table, String 
hoodieInstant) {
+  }
+

Review Comment:
   let me know your thoughts on this. idea is, need to create commit marker 
file once clustering operation gets committed. 
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rohan-uptycs commented on a diff in pull request #8503: [HUDI-6047] Clustering operation on consistent hashing index resulting in duplicate data

2023-05-06 Thread via GitHub



rohan-uptycs commented on code in PR #8503:
URL: https://github.com/apache/hudi/pull/8503#discussion_r1186717343


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndex.java:
##
@@ -154,6 +154,14 @@ public boolean requiresTagging(WriteOperationType 
operationType) {
   public void close() {
   }
 
+  /***
+   * Updates index metadata of the given table and instant if needed.
+   * @param table The committed table.
+   * @param hoodieInstant The instant to commit.
+   */
+  public void commitIndexMetadataIfNeeded(HoodieTable table, String 
hoodieInstant) {
+  }
+

Review Comment:
   One thing we can do if we don't want to use CommitIndexMetadta interface is 
to let loadMetadta function create commit marker file , when reader ask for 
updated metadata, if replace commit is on active timeline it will create marker 
file for it else Commit marker recovery function(recommitMetadataFile) will 
decide if this metadata file is latest or not and create marker file if it is 
latest.
   
   Only cost of this will be listing all filegroups for that partition and 
comparing against metadata file. this will be done only once when commit file 
is not present.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rohan-uptycs commented on a diff in pull request #8503: [HUDI-6047] Clustering operation on consistent hashing index resulting in duplicate data

2023-05-06 Thread via GitHub



rohan-uptycs commented on code in PR #8503:
URL: https://github.com/apache/hudi/pull/8503#discussion_r1186717343


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndex.java:
##
@@ -154,6 +154,14 @@ public boolean requiresTagging(WriteOperationType 
operationType) {
   public void close() {
   }
 
+  /***
+   * Updates index metadata of the given table and instant if needed.
+   * @param table The committed table.
+   * @param hoodieInstant The instant to commit.
+   */
+  public void commitIndexMetadataIfNeeded(HoodieTable table, String 
hoodieInstant) {
+  }
+

Review Comment:
   One thing we can do if we don't want to use CommitIndexMetadta interface is 
to let loadMetadta function create commit marker file , when reader ask for 
updated metadata, Commit marker recovery function(recommitMetadataFile) will 
decide if this metadata file is latest or not and create marker file if it is 
latest.
   
   Only cost of this will be listing all filegroups for that partition and 
comparing against metadata file. this will be done only once when commit file 
is not present.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rohan-uptycs commented on a diff in pull request #8503: [HUDI-6047] Clustering operation on consistent hashing index resulting in duplicate data

2023-05-06 Thread via GitHub



rohan-uptycs commented on code in PR #8503:
URL: https://github.com/apache/hudi/pull/8503#discussion_r1186689133


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndex.java:
##
@@ -154,6 +154,14 @@ public boolean requiresTagging(WriteOperationType 
operationType) {
   public void close() {
   }
 
+  /***
+   * Updates index metadata of the given table and instant if needed.
+   * @param table The committed table.
+   * @param hoodieInstant The instant to commit.
+   */
+  public void commitIndexMetadataIfNeeded(HoodieTable table, String 
hoodieInstant) {
+  }
+

Review Comment:
   let me know your thoughts on this. idea is need to create commit marker file 
once clustering operation gets committed. 
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rohan-uptycs commented on a diff in pull request #8503: [HUDI-6047] Clustering operation on consistent hashing index resulting in duplicate data

2023-05-06 Thread via GitHub



rohan-uptycs commented on code in PR #8503:
URL: https://github.com/apache/hudi/pull/8503#discussion_r1186683224


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndex.java:
##
@@ -154,6 +154,14 @@ public boolean requiresTagging(WriteOperationType 
operationType) {
   public void close() {
   }
 
+  /***
+   * Updates index metadata of the given table and instant if needed.
+   * @param table The committed table.
+   * @param hoodieInstant The instant to commit.
+   */
+  public void commitIndexMetadataIfNeeded(HoodieTable table, String 
hoodieInstant) {
+  }
+

Review Comment:
   updateLocation is pre-commit operation, that's the reason i haven't used 
it.I was thinking we can use commitIndexMetadata interface for post commit 
operation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rohan-uptycs commented on a diff in pull request #8503: [HUDI-6047] Clustering operation on consistent hashing index resulting in duplicate data

2023-05-06 Thread via GitHub



rohan-uptycs commented on code in PR #8503:
URL: https://github.com/apache/hudi/pull/8503#discussion_r1186683224


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndex.java:
##
@@ -154,6 +154,14 @@ public boolean requiresTagging(WriteOperationType 
operationType) {
   public void close() {
   }
 
+  /***
+   * Updates index metadata of the given table and instant if needed.
+   * @param table The committed table.
+   * @param hoodieInstant The instant to commit.
+   */
+  public void commitIndexMetadataIfNeeded(HoodieTable table, String 
hoodieInstant) {
+  }
+

Review Comment:
   updateLocation is pre-commit operation, that's the reason i haven't used 
it.I was thinking we can use commitIndexMetadta interface for post commit 
operation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rohan-uptycs commented on a diff in pull request #8503: [HUDI-6047] Clustering operation on consistent hashing index resulting in duplicate data

2023-05-05 Thread via GitHub



rohan-uptycs commented on code in PR #8503:
URL: https://github.com/apache/hudi/pull/8503#discussion_r1186018082


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/bucket/HoodieSparkConsistentBucketIndex.java:
##
@@ -271,8 +327,114 @@ public Option 
getRecordLocation(HoodieKey key) {
   }
 
   LOG.error("Consistent hashing node has no file group, partition: " + 
partitionPath + ", meta: "
-  + 
partitionToIdentifier.get(partitionPath).getMetadata().getFilename() + ", 
record_key: " + key.toString());
+  + 
partitionToIdentifier.get(partitionPath).getMetadata().getFilename() + ", 
record_key: " + key.toString());
   throw new HoodieIndexException("Failed to getBucket as hashing node has 
no file group");
 }
   }
+
+  /***
+   * Create commit marker -> hoodieinstant.commit in metadata folder, 
consistent hashing metadata reader will use it to
+   * identify latest commited file which will have updated commit metadata
+   * @param table
+   * @param hoodieInstant
+   */
+  public void commitIndexMetadataIfNeeded(HoodieTable table, String 
hoodieInstant) {
+Option> instantPlanPair =
+ClusteringUtils.getClusteringPlan(table.getMetaClient(), 
HoodieTimeline.getReplaceCommitRequestedInstant(hoodieInstant));
+if (!instantPlanPair.isPresent()) {
+  return;
+}
+HoodieClusteringPlan plan = instantPlanPair.get().getRight();
+List> partitionMapList = 
plan.getInputGroups().stream().map(HoodieClusteringGroup::getExtraMetadata).collect(Collectors.toList());
+partitionMapList.stream().forEach(partitionMap -> {
+  String partition = 
partitionMap.get(SparkConsistentBucketClusteringPlanStrategy.METADATA_PARTITION_KEY);
+  Path metadataPartitionPath = 
FSUtils.getPartitionPath(table.getMetaClient().getHashingMetadataPath(), 
partition);
+  Path metadataFilePath = new Path(metadataPartitionPath, hoodieInstant + 
HASHING_METADATA_FILE_SUFFIX);
+  try {
+if (table.getMetaClient().getFs().exists(metadataFilePath)) {
+  createCommitMarker(table, metadataFilePath, metadataPartitionPath);
+}
+  } catch (IOException e) {
+throw new HoodieIOException("exception while committing hashing 
metadata for path " + metadataFilePath, e);
+  }
+});
+  }
+
+  /***
+   * Create commit marker corresponding to hashing metadata file after post 
commit clustering operation
+   * @param table

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rohan-uptycs commented on a diff in pull request #8503: [HUDI-6047] Clustering operation on consistent hashing index resulting in duplicate data

2023-05-05 Thread via GitHub



rohan-uptycs commented on code in PR #8503:
URL: https://github.com/apache/hudi/pull/8503#discussion_r1186017867


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/bucket/HoodieSparkConsistentBucketIndex.java:
##
@@ -271,8 +327,114 @@ public Option 
getRecordLocation(HoodieKey key) {
   }
 
   LOG.error("Consistent hashing node has no file group, partition: " + 
partitionPath + ", meta: "
-  + 
partitionToIdentifier.get(partitionPath).getMetadata().getFilename() + ", 
record_key: " + key.toString());
+  + 
partitionToIdentifier.get(partitionPath).getMetadata().getFilename() + ", 
record_key: " + key.toString());
   throw new HoodieIndexException("Failed to getBucket as hashing node has 
no file group");
 }
   }
+
+  /***
+   * Create commit marker -> hoodieinstant.commit in metadata folder, 
consistent hashing metadata reader will use it to
+   * identify latest commited file which will have updated commit metadata
+   * @param table
+   * @param hoodieInstant
+   */
+  public void commitIndexMetadataIfNeeded(HoodieTable table, String 
hoodieInstant) {
+Option> instantPlanPair =
+ClusteringUtils.getClusteringPlan(table.getMetaClient(), 
HoodieTimeline.getReplaceCommitRequestedInstant(hoodieInstant));
+if (!instantPlanPair.isPresent()) {
+  return;
+}
+HoodieClusteringPlan plan = instantPlanPair.get().getRight();
+List> partitionMapList = 
plan.getInputGroups().stream().map(HoodieClusteringGroup::getExtraMetadata).collect(Collectors.toList());
+partitionMapList.stream().forEach(partitionMap -> {
+  String partition = 
partitionMap.get(SparkConsistentBucketClusteringPlanStrategy.METADATA_PARTITION_KEY);
+  Path metadataPartitionPath = 
FSUtils.getPartitionPath(table.getMetaClient().getHashingMetadataPath(), 
partition);
+  Path metadataFilePath = new Path(metadataPartitionPath, hoodieInstant + 
HASHING_METADATA_FILE_SUFFIX);
+  try {
+if (table.getMetaClient().getFs().exists(metadataFilePath)) {
+  createCommitMarker(table, metadataFilePath, metadataPartitionPath);
+}
+  } catch (IOException e) {
+throw new HoodieIOException("exception while committing hashing 
metadata for path " + metadataFilePath, e);
+  }
+});
+  }
+
+  /***
+   * Create commit marker corresponding to hashing metadata file after post 
commit clustering operation
+   * @param table
+   * @param fileStatus
+   * @param partitionPath
+   * @throws IOException
+   */
+  private static void createCommitMarker(HoodieTable table, Path fileStatus, 
Path partitionPath) throws IOException {
+HoodieWrapperFileSystem fs = table.getMetaClient().getFs();
+Path fullPath = new Path(partitionPath, 
getTimestampFromFile(fileStatus.getName()) + 
HASHING_METADATA_COMMIT_FILE_SUFFIX);
+if (fs.exists(fullPath)) {
+  return;
+}
+String metadata = "";
+FileIOUtils.createFileInPath(fs, fullPath, Option.of(metadata.getBytes()));
+  }
+
+  /***
+   * Load consistent hashing metadata from given file
+   * @param table
+   * @param metaFile
+   * @return
+   */
+  private static Option 
loadMetadataFromGivenFile(HoodieTable table, FileStatus metaFile) {
+try {
+  if (metaFile == null) {
+return Option.empty();
+  }
+  byte[] content = 
FileIOUtils.readAsByteArray(table.getMetaClient().getFs().open(metaFile.getPath()));
+  return Option.of(HoodieConsistentHashingMetadata.fromBytes(content));
+} catch (FileNotFoundException e) {
+  return Option.empty();
+} catch (IOException e) {
+  LOG.error("Error when loading hashing metadata, for path: " + 
metaFile.getPath().getName(), e);
+  throw new HoodieIndexException("Error while loading hashing metadata", 
e);
+}
+  }
+
+  /***
+   * COMMIT MARKER RECOVERY JOB.
+   * If particular hashing metadta file doesn't have commit marker then there 
could be a case where clustering is done but post commit marker
+   * creation operation failed. In this case this method will check file group 
id from consistent hashing metadata against storage base file group ids.
+   * if one of the file group matches then we can conclude that this is the 
latest metadata file.
+   * Note : we will end up calling this method if there is no marker file and 
no replace commit on active timeline, if replace commit is not present on
+   * active timeline that means old file group id's before clustering 
operation got cleaned and only new file group id's  of current clustering 
operation
+   * are present on the disk.
+   * @param table

Review Comment:
   done



##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/bucket/HoodieSparkConsistentBucketIndex.java:
##
@@ -271,8 +327,114 @@ public Option 
getRecordLocation(HoodieKey key) {
   }
 
   LOG.error("Consistent hashing node has no file group, partition: " + 
partitionPath + ", meta: "
-  +

[GitHub] [hudi] rohan-uptycs commented on a diff in pull request #8503: [HUDI-6047] Clustering operation on consistent hashing index resulting in duplicate data

2023-05-05 Thread via GitHub



rohan-uptycs commented on code in PR #8503:
URL: https://github.com/apache/hudi/pull/8503#discussion_r1186010748


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/bucket/HoodieSparkConsistentBucketIndex.java:
##
@@ -271,8 +327,114 @@ public Option 
getRecordLocation(HoodieKey key) {
   }
 
   LOG.error("Consistent hashing node has no file group, partition: " + 
partitionPath + ", meta: "
-  + 
partitionToIdentifier.get(partitionPath).getMetadata().getFilename() + ", 
record_key: " + key.toString());
+  + 
partitionToIdentifier.get(partitionPath).getMetadata().getFilename() + ", 
record_key: " + key.toString());
   throw new HoodieIndexException("Failed to getBucket as hashing node has 
no file group");
 }
   }
+
+  /***
+   * Create commit marker -> hoodieinstant.commit in metadata folder, 
consistent hashing metadata reader will use it to
+   * identify latest commited file which will have updated commit metadata
+   * @param table
+   * @param hoodieInstant
+   */
+  public void commitIndexMetadataIfNeeded(HoodieTable table, String 
hoodieInstant) {
+Option> instantPlanPair =
+ClusteringUtils.getClusteringPlan(table.getMetaClient(), 
HoodieTimeline.getReplaceCommitRequestedInstant(hoodieInstant));
+if (!instantPlanPair.isPresent()) {
+  return;
+}
+HoodieClusteringPlan plan = instantPlanPair.get().getRight();
+List> partitionMapList = 
plan.getInputGroups().stream().map(HoodieClusteringGroup::getExtraMetadata).collect(Collectors.toList());
+partitionMapList.stream().forEach(partitionMap -> {
+  String partition = 
partitionMap.get(SparkConsistentBucketClusteringPlanStrategy.METADATA_PARTITION_KEY);
+  Path metadataPartitionPath = 
FSUtils.getPartitionPath(table.getMetaClient().getHashingMetadataPath(), 
partition);
+  Path metadataFilePath = new Path(metadataPartitionPath, hoodieInstant + 
HASHING_METADATA_FILE_SUFFIX);
+  try {
+if (table.getMetaClient().getFs().exists(metadataFilePath)) {
+  createCommitMarker(table, metadataFilePath, metadataPartitionPath);
+}
+  } catch (IOException e) {
+throw new HoodieIOException("exception while committing hashing 
metadata for path " + metadataFilePath, e);
+  }
+});
+  }
+
+  /***
+   * Create commit marker corresponding to hashing metadata file after post 
commit clustering operation
+   * @param table
+   * @param fileStatus
+   * @param partitionPath
+   * @throws IOException
+   */
+  private static void createCommitMarker(HoodieTable table, Path fileStatus, 
Path partitionPath) throws IOException {
+HoodieWrapperFileSystem fs = table.getMetaClient().getFs();
+Path fullPath = new Path(partitionPath, 
getTimestampFromFile(fileStatus.getName()) + 
HASHING_METADATA_COMMIT_FILE_SUFFIX);
+if (fs.exists(fullPath)) {
+  return;
+}
+String metadata = "";
+FileIOUtils.createFileInPath(fs, fullPath, Option.of(metadata.getBytes()));
+  }
+
+  /***
+   * Load consistent hashing metadata from given file
+   * @param table

Review Comment:
   sure 



##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/bucket/HoodieSparkConsistentBucketIndex.java:
##
@@ -271,8 +327,114 @@ public Option 
getRecordLocation(HoodieKey key) {
   }
 
   LOG.error("Consistent hashing node has no file group, partition: " + 
partitionPath + ", meta: "
-  + 
partitionToIdentifier.get(partitionPath).getMetadata().getFilename() + ", 
record_key: " + key.toString());
+  + 
partitionToIdentifier.get(partitionPath).getMetadata().getFilename() + ", 
record_key: " + key.toString());
   throw new HoodieIndexException("Failed to getBucket as hashing node has 
no file group");
 }
   }
+
+  /***
+   * Create commit marker -> hoodieinstant.commit in metadata folder, 
consistent hashing metadata reader will use it to
+   * identify latest commited file which will have updated commit metadata
+   * @param table
+   * @param hoodieInstant
+   */
+  public void commitIndexMetadataIfNeeded(HoodieTable table, String 
hoodieInstant) {
+Option> instantPlanPair =
+ClusteringUtils.getClusteringPlan(table.getMetaClient(), 
HoodieTimeline.getReplaceCommitRequestedInstant(hoodieInstant));
+if (!instantPlanPair.isPresent()) {
+  return;
+}
+HoodieClusteringPlan plan = instantPlanPair.get().getRight();
+List> partitionMapList = 
plan.getInputGroups().stream().map(HoodieClusteringGroup::getExtraMetadata).collect(Collectors.toList());
+partitionMapList.stream().forEach(partitionMap -> {
+  String partition = 
partitionMap.get(SparkConsistentBucketClusteringPlanStrategy.METADATA_PARTITION_KEY);
+  Path metadataPartitionPath = 
FSUtils.getPartitionPath(table.getMetaClient().getHashingMetadataPath(), 
partition);
+  Path metadataFilePath = new Path(metadataPartitionPath, hoodieInstant +

[GitHub] [hudi] rohan-uptycs commented on a diff in pull request #8503: [HUDI-6047] Clustering operation on consistent hashing index resulting in duplicate data

2023-05-04 Thread via GitHub



rohan-uptycs commented on code in PR #8503:
URL: https://github.com/apache/hudi/pull/8503#discussion_r1185678542


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/bucket/HoodieSparkConsistentBucketIndex.java:
##
@@ -275,4 +280,49 @@ public Option 
getRecordLocation(HoodieKey key) {
   throw new HoodieIndexException("Failed to getBucket as hashing node has 
no file group");
 }
   }
+
+  /**
+   * Update default metadata file(00.hashing_meta) with the latest 
committed metadata file so that default file will be in sync
+   * with latest commit.
+   *
+   * @param table
+   */
+  public void updateArchivalDependentIndexMetadata(HoodieTable 
table,List hoodieArchivalInstants) {
+Map partitionVisiteddMap = new HashMap<>();
+// Update metadata for replace commit which are going to get archived.
+Stream hoodieListOfReplacedInstants = 
hoodieArchivalInstants.stream().filter(instane -> 
instane.getAction().equals(REPLACE_COMMIT_ACTION));
+hoodieListOfReplacedInstants.forEach(instant -> {
+  Option> instantPlanPair =
+  ClusteringUtils.getClusteringPlan(table.getMetaClient(), 
HoodieTimeline.getReplaceCommitRequestedInstant(instant.getTimestamp()));
+  if (instantPlanPair.isPresent()) {
+HoodieClusteringPlan plan = instantPlanPair.get().getRight();
+List> partitionMapList = 
plan.getInputGroups().stream().map(HoodieClusteringGroup::getExtraMetadata).collect(Collectors.toList());
+partitionMapList.stream().forEach(partitionMap -> {
+  String partition = 
partitionMap.get(SparkConsistentBucketClusteringPlanStrategy.METADATA_PARTITION_KEY);
+  if (!partitionVisiteddMap.containsKey(partition)) {
+Option 
hoodieConsistentHashingMetadataOption = loadMetadata(table, partition);
+if (hoodieConsistentHashingMetadataOption.isPresent()) {

Review Comment:
   @danny0405 @SteNicholas modified the approach.
   Creating commit marker file for completed clustering operation post commit. 
   Reader will use commit marker as indicator to get latest metadata for 
consistent hashing.
   If commit marker fails post commit, there is recovery job which will decide 
if its a valid latest metadata and create the marker file.
   
   Please review. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rohan-uptycs commented on a diff in pull request #8503: [HUDI-6047] Clustering operation on consistent hashing index resulting in duplicate data

2023-05-02 Thread via GitHub



rohan-uptycs commented on code in PR #8503:
URL: https://github.com/apache/hudi/pull/8503#discussion_r1182233786


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/bucket/HoodieSparkConsistentBucketIndex.java:
##
@@ -275,4 +280,49 @@ public Option 
getRecordLocation(HoodieKey key) {
   throw new HoodieIndexException("Failed to getBucket as hashing node has 
no file group");
 }
   }
+
+  /**
+   * Update default metadata file(00.hashing_meta) with the latest 
committed metadata file so that default file will be in sync
+   * with latest commit.
+   *
+   * @param table
+   */
+  public void updateArchivalDependentIndexMetadata(HoodieTable 
table,List hoodieArchivalInstants) {
+Map partitionVisiteddMap = new HashMap<>();
+// Update metadata for replace commit which are going to get archived.
+Stream hoodieListOfReplacedInstants = 
hoodieArchivalInstants.stream().filter(instane -> 
instane.getAction().equals(REPLACE_COMMIT_ACTION));
+hoodieListOfReplacedInstants.forEach(instant -> {
+  Option> instantPlanPair =
+  ClusteringUtils.getClusteringPlan(table.getMetaClient(), 
HoodieTimeline.getReplaceCommitRequestedInstant(instant.getTimestamp()));
+  if (instantPlanPair.isPresent()) {
+HoodieClusteringPlan plan = instantPlanPair.get().getRight();
+List> partitionMapList = 
plan.getInputGroups().stream().map(HoodieClusteringGroup::getExtraMetadata).collect(Collectors.toList());
+partitionMapList.stream().forEach(partitionMap -> {
+  String partition = 
partitionMap.get(SparkConsistentBucketClusteringPlanStrategy.METADATA_PARTITION_KEY);
+  if (!partitionVisiteddMap.containsKey(partition)) {
+Option 
hoodieConsistentHashingMetadataOption = loadMetadata(table, partition);
+if (hoodieConsistentHashingMetadataOption.isPresent()) {

Review Comment:
   Okay, will add the trigger in clustering operation itself. let me see how we 
can do that



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rohan-uptycs commented on a diff in pull request #8503: [HUDI-6047] Clustering operation on consistent hashing index resulting in duplicate data

2023-05-02 Thread via GitHub



rohan-uptycs commented on code in PR #8503:
URL: https://github.com/apache/hudi/pull/8503#discussion_r1182220542


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/bucket/HoodieSparkConsistentBucketIndex.java:
##
@@ -275,4 +280,49 @@ public Option 
getRecordLocation(HoodieKey key) {
   throw new HoodieIndexException("Failed to getBucket as hashing node has 
no file group");
 }
   }
+
+  /**
+   * Update default metadata file(00.hashing_meta) with the latest 
committed metadata file so that default file will be in sync
+   * with latest commit.
+   *
+   * @param table
+   */
+  public void updateArchivalDependentIndexMetadata(HoodieTable 
table,List hoodieArchivalInstants) {
+Map partitionVisiteddMap = new HashMap<>();
+// Update metadata for replace commit which are going to get archived.
+Stream hoodieListOfReplacedInstants = 
hoodieArchivalInstants.stream().filter(instane -> 
instane.getAction().equals(REPLACE_COMMIT_ACTION));
+hoodieListOfReplacedInstants.forEach(instant -> {
+  Option> instantPlanPair =
+  ClusteringUtils.getClusteringPlan(table.getMetaClient(), 
HoodieTimeline.getReplaceCommitRequestedInstant(instant.getTimestamp()));
+  if (instantPlanPair.isPresent()) {
+HoodieClusteringPlan plan = instantPlanPair.get().getRight();
+List> partitionMapList = 
plan.getInputGroups().stream().map(HoodieClusteringGroup::getExtraMetadata).collect(Collectors.toList());
+partitionMapList.stream().forEach(partitionMap -> {
+  String partition = 
partitionMap.get(SparkConsistentBucketClusteringPlanStrategy.METADATA_PARTITION_KEY);
+  if (!partitionVisiteddMap.containsKey(partition)) {
+Option 
hoodieConsistentHashingMetadataOption = loadMetadata(table, partition);
+if (hoodieConsistentHashingMetadataOption.isPresent()) {

Review Comment:
   @danny0405 , Understood, one question lets say transaction fails at 
transitioning  the state from inflight to complete replace commit, does 
**update to metadata file(default file 0*.meta)  operation get reverted**?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rohan-uptycs commented on a diff in pull request #8503: [HUDI-6047] Clustering operation on consistent hashing index resulting in duplicate data

2023-05-02 Thread via GitHub



rohan-uptycs commented on code in PR #8503:
URL: https://github.com/apache/hudi/pull/8503#discussion_r1182163517


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/bucket/HoodieSparkConsistentBucketIndex.java:
##
@@ -275,4 +280,49 @@ public Option 
getRecordLocation(HoodieKey key) {
   throw new HoodieIndexException("Failed to getBucket as hashing node has 
no file group");
 }
   }
+
+  /**
+   * Update default metadata file(00.hashing_meta) with the latest 
committed metadata file so that default file will be in sync
+   * with latest commit.
+   *
+   * @param table
+   */
+  public void updateArchivalDependentIndexMetadata(HoodieTable 
table,List hoodieArchivalInstants) {
+Map partitionVisiteddMap = new HashMap<>();
+// Update metadata for replace commit which are going to get archived.
+Stream hoodieListOfReplacedInstants = 
hoodieArchivalInstants.stream().filter(instane -> 
instane.getAction().equals(REPLACE_COMMIT_ACTION));
+hoodieListOfReplacedInstants.forEach(instant -> {
+  Option> instantPlanPair =
+  ClusteringUtils.getClusteringPlan(table.getMetaClient(), 
HoodieTimeline.getReplaceCommitRequestedInstant(instant.getTimestamp()));
+  if (instantPlanPair.isPresent()) {
+HoodieClusteringPlan plan = instantPlanPair.get().getRight();
+List> partitionMapList = 
plan.getInputGroups().stream().map(HoodieClusteringGroup::getExtraMetadata).collect(Collectors.toList());
+partitionMapList.stream().forEach(partitionMap -> {
+  String partition = 
partitionMap.get(SparkConsistentBucketClusteringPlanStrategy.METADATA_PARTITION_KEY);
+  if (!partitionVisiteddMap.containsKey(partition)) {
+Option 
hoodieConsistentHashingMetadataOption = loadMetadata(table, partition);
+if (hoodieConsistentHashingMetadataOption.isPresent()) {

Review Comment:
   let me give few examples which can cause inconsistency
   Example 1
   lets say we are updating metadat before clustering commit
   1 - written the data files on storage
   2 - update metatda
   3 - commited clustering operation on hudi timeline by creating replace commit
   
   now let's say **2nd operation is done but 3rd operation fails** in this case 
**metadata got synced but clustering is failed. all the subsequent write 
operation will read from failed clustering metadata synced file.** 
   
   Example 2
   lets say we are updating metadat after clustering commit
   1 -  written the data files on storage
   2 - commited clustering operation on hudi timeline by creating replace commit
   3 - update metadata
   
   now let's say 3rd operation of updating metadata fails , then latest 
metadata commited file will not be in sync default metadata file 
(000*.meta),and there is no scheduled mechanism to bring it in sync state. 
hence once archival archives latest metadata file related replace commit from 
timeline, all the writer will start reading from default metadata file 
(000*.meta) and this will cause data duplication.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rohan-uptycs commented on a diff in pull request #8503: [HUDI-6047] Clustering operation on consistent hashing index resulting in duplicate data

2023-05-02 Thread via GitHub



rohan-uptycs commented on code in PR #8503:
URL: https://github.com/apache/hudi/pull/8503#discussion_r1182163517


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/bucket/HoodieSparkConsistentBucketIndex.java:
##
@@ -275,4 +280,49 @@ public Option 
getRecordLocation(HoodieKey key) {
   throw new HoodieIndexException("Failed to getBucket as hashing node has 
no file group");
 }
   }
+
+  /**
+   * Update default metadata file(00.hashing_meta) with the latest 
committed metadata file so that default file will be in sync
+   * with latest commit.
+   *
+   * @param table
+   */
+  public void updateArchivalDependentIndexMetadata(HoodieTable 
table,List hoodieArchivalInstants) {
+Map partitionVisiteddMap = new HashMap<>();
+// Update metadata for replace commit which are going to get archived.
+Stream hoodieListOfReplacedInstants = 
hoodieArchivalInstants.stream().filter(instane -> 
instane.getAction().equals(REPLACE_COMMIT_ACTION));
+hoodieListOfReplacedInstants.forEach(instant -> {
+  Option> instantPlanPair =
+  ClusteringUtils.getClusteringPlan(table.getMetaClient(), 
HoodieTimeline.getReplaceCommitRequestedInstant(instant.getTimestamp()));
+  if (instantPlanPair.isPresent()) {
+HoodieClusteringPlan plan = instantPlanPair.get().getRight();
+List> partitionMapList = 
plan.getInputGroups().stream().map(HoodieClusteringGroup::getExtraMetadata).collect(Collectors.toList());
+partitionMapList.stream().forEach(partitionMap -> {
+  String partition = 
partitionMap.get(SparkConsistentBucketClusteringPlanStrategy.METADATA_PARTITION_KEY);
+  if (!partitionVisiteddMap.containsKey(partition)) {
+Option 
hoodieConsistentHashingMetadataOption = loadMetadata(table, partition);
+if (hoodieConsistentHashingMetadataOption.isPresent()) {

Review Comment:
   let me give few examples which can cause inconsistency
   Example 1
   lets say we are updating metadat before clustering commit
   1 - written the data files on storage
   2 - update metatda
   3 - commited clustering operation on hudi timeline by creating replace commit
   
   now let's say **2nd operation is done but 3rd operation fails** in this case 
**metadata got synced but clustering is failed. all the subsequent write 
operation will read from failed clustering metadata operation.** 
   
   Example 2
   lets say we are updating metadat after clustering commit
   1 -  written the data files on storage
   2 - commited clustering operation on hudi timeline by creating replace commit
   3 - update metadata
   
   now let's say 3rd operation of updating metadata fails , then latest 
metadata commited file will not be in sync default metadata file 
(000*.meta),and there is no scheduled mechanism to bring it in sync state. 
hence once archival archives latest metadata file related replace commit from 
timeline, all the writer will start reading from default metadata file 
(000*.meta) and this will cause data duplication.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rohan-uptycs commented on a diff in pull request #8503: [HUDI-6047] Clustering operation on consistent hashing index resulting in duplicate data

2023-05-01 Thread via GitHub



rohan-uptycs commented on code in PR #8503:
URL: https://github.com/apache/hudi/pull/8503#discussion_r1181485921


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/bucket/HoodieSparkConsistentBucketIndex.java:
##
@@ -275,4 +280,49 @@ public Option 
getRecordLocation(HoodieKey key) {
   throw new HoodieIndexException("Failed to getBucket as hashing node has 
no file group");
 }
   }
+
+  /**
+   * Update default metadata file(00.hashing_meta) with the latest 
committed metadata file so that default file will be in sync
+   * with latest commit.
+   *
+   * @param table
+   */
+  public void updateArchivalDependentIndexMetadata(HoodieTable 
table,List hoodieArchivalInstants) {
+Map partitionVisiteddMap = new HashMap<>();
+// Update metadata for replace commit which are going to get archived.
+Stream hoodieListOfReplacedInstants = 
hoodieArchivalInstants.stream().filter(instane -> 
instane.getAction().equals(REPLACE_COMMIT_ACTION));
+hoodieListOfReplacedInstants.forEach(instant -> {
+  Option> instantPlanPair =
+  ClusteringUtils.getClusteringPlan(table.getMetaClient(), 
HoodieTimeline.getReplaceCommitRequestedInstant(instant.getTimestamp()));
+  if (instantPlanPair.isPresent()) {
+HoodieClusteringPlan plan = instantPlanPair.get().getRight();
+List> partitionMapList = 
plan.getInputGroups().stream().map(HoodieClusteringGroup::getExtraMetadata).collect(Collectors.toList());
+partitionMapList.stream().forEach(partitionMap -> {
+  String partition = 
partitionMap.get(SparkConsistentBucketClusteringPlanStrategy.METADATA_PARTITION_KEY);
+  if (!partitionVisiteddMap.containsKey(partition)) {
+Option 
hoodieConsistentHashingMetadataOption = loadMetadata(table, partition);
+if (hoodieConsistentHashingMetadataOption.isPresent()) {

Review Comment:
   @danny0405 , There are consistency issues we might face if we try updating 
metadata after finishing clustering commit.
   What if underlying file system is down and updateMetadata fails to sync 
metadata, then there is no mechanism to bring it in sync with latest committed 
metadata, and archival will remove replace commit eventually and it will end up 
in an inconsistent state.
   On the other hand in archival process , it will be eventually in sync with 
committed metadata before replace commit getting archived.
   I think consistent hashing metadata has strong dependency on archival 
process, As it is dependent on active timeline replaced commit to load metadata.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rohan-uptycs commented on a diff in pull request #8503: [HUDI-6047] Clustering operation on consistent hashing index resulting in duplicate data

2023-04-28 Thread via GitHub



rohan-uptycs commented on code in PR #8503:
URL: https://github.com/apache/hudi/pull/8503#discussion_r1180159210


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java:
##
@@ -509,7 +509,15 @@ private Stream getCommitInstantsToArchive() 
throws IOException {
   }
 
   private Stream getInstantsToArchive() throws IOException {
-Stream instants = 
Stream.concat(getCleanInstantsToArchive(), getCommitInstantsToArchive());
+List commitInstantsToArchive = 
getCommitInstantsToArchive().collect(Collectors.toList());
+Stream instants = 
Stream.concat(getCleanInstantsToArchive(), commitInstantsToArchive.stream());
+HoodieInstant hoodieOldestInstantToArchive = 
commitInstantsToArchive.stream().max(Comparator.comparing(maxInstant -> 
maxInstant.getTimestamp())).orElse(null);
+/**
+ * if hoodieOldestInstantToArchive is null that means nothing is getting 
archived, so no need to update metadata
+ */
+if (hoodieOldestInstantToArchive != null) {
+  table.getIndex().updateMetadata(table, 
Option.of(hoodieOldestInstantToArchive));

Review Comment:
   @SteNicholas , done. please review 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rohan-uptycs commented on a diff in pull request #8503: [HUDI-6047] Clustering operation on consistent hashing index resulting in duplicate data

2023-04-28 Thread via GitHub



rohan-uptycs commented on code in PR #8503:
URL: https://github.com/apache/hudi/pull/8503#discussion_r1180014232


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java:
##
@@ -509,7 +509,15 @@ private Stream getCommitInstantsToArchive() 
throws IOException {
   }
 
   private Stream getInstantsToArchive() throws IOException {
-Stream instants = 
Stream.concat(getCleanInstantsToArchive(), getCommitInstantsToArchive());
+List commitInstantsToArchive = 
getCommitInstantsToArchive().collect(Collectors.toList());
+Stream instants = 
Stream.concat(getCleanInstantsToArchive(), commitInstantsToArchive.stream());
+HoodieInstant hoodieOldestInstantToArchive = 
commitInstantsToArchive.stream().max(Comparator.comparing(maxInstant -> 
maxInstant.getTimestamp())).orElse(null);
+/**
+ * if hoodieOldestInstantToArchive is null that means nothing is getting 
archived, so no need to update metadata
+ */
+if (hoodieOldestInstantToArchive != null) {
+  table.getIndex().updateMetadata(table, 
Option.of(hoodieOldestInstantToArchive));

Review Comment:
   If we rename interface method as **updateArchivalDependentIndexMetadata** 
and call it in **archiveIfRequired** method will that be okay ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rohan-uptycs commented on a diff in pull request #8503: [HUDI-6047] Clustering operation on consistent hashing index resulting in duplicate data

2023-04-27 Thread via GitHub



rohan-uptycs commented on code in PR #8503:
URL: https://github.com/apache/hudi/pull/8503#discussion_r1179895882


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java:
##
@@ -509,7 +509,15 @@ private Stream getCommitInstantsToArchive() 
throws IOException {
   }
 
   private Stream getInstantsToArchive() throws IOException {
-Stream instants = 
Stream.concat(getCleanInstantsToArchive(), getCommitInstantsToArchive());
+List commitInstantsToArchive = 
getCommitInstantsToArchive().collect(Collectors.toList());
+Stream instants = 
Stream.concat(getCleanInstantsToArchive(), commitInstantsToArchive.stream());
+HoodieInstant hoodieOldestInstantToArchive = 
commitInstantsToArchive.stream().max(Comparator.comparing(maxInstant -> 
maxInstant.getTimestamp())).orElse(null);
+/**
+ * if hoodieOldestInstantToArchive is null that means nothing is getting 
archived, so no need to update metadata
+ */
+if (hoodieOldestInstantToArchive != null) {
+  table.getIndex().updateMetadata(table, 
Option.of(hoodieOldestInstantToArchive));

Review Comment:
   @SteNicholas , Yeah it can be invoked but i see few problems with it 
   What if underlying file system is down and **updateMetadata** fails to sync 
metadata, then there is no  mechanism to bring it in sync with latest committed 
metadata, and archival will remove replace commit eventually and it will end up 
in an inconsistent state.
   On the other hand in **archival process , it will be eventually in sync with 
committed metadata**  before replace commit getting archived.
   I think consistent hashing metadata has strong dependency on archival 
process, As it is dependent on active timeline replaced commit to load metadata.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rohan-uptycs commented on a diff in pull request #8503: [HUDI-6047] Clustering operation on consistent hashing index resulting in duplicate data

2023-04-27 Thread via GitHub



rohan-uptycs commented on code in PR #8503:
URL: https://github.com/apache/hudi/pull/8503#discussion_r1178710798


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java:
##
@@ -509,7 +509,15 @@ private Stream getCommitInstantsToArchive() 
throws IOException {
   }
 
   private Stream getInstantsToArchive() throws IOException {
-Stream instants = 
Stream.concat(getCleanInstantsToArchive(), getCommitInstantsToArchive());
+List commitInstantsToArchive = 
getCommitInstantsToArchive().collect(Collectors.toList());
+Stream instants = 
Stream.concat(getCleanInstantsToArchive(), commitInstantsToArchive.stream());
+HoodieInstant hoodieOldestInstantToArchive = 
commitInstantsToArchive.stream().max(Comparator.comparing(maxInstant -> 
maxInstant.getTimestamp())).orElse(null);
+/**
+ * if hoodieOldestInstantToArchive is null that means nothing is getting 
archived, so no need to update metadata
+ */
+if (hoodieOldestInstantToArchive != null) {
+  table.getIndex().updateMetadata(table, 
Option.of(hoodieOldestInstantToArchive));

Review Comment:
   @SteNicholas , using postCommit has some flaws and concerns. if postcommit 
operation fails due to some reason, then metadata sync will not happen , and 
metadata state will remain in in-consistent state. 
   But in case of archival it will get updated in next archival trigger.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rohan-uptycs commented on a diff in pull request #8503: [HUDI-6047] Clustering operation on consistent hashing index resulting in duplicate data

2023-04-26 Thread via GitHub



rohan-uptycs commented on code in PR #8503:
URL: https://github.com/apache/hudi/pull/8503#discussion_r1177555612


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java:
##
@@ -509,7 +509,15 @@ private Stream getCommitInstantsToArchive() 
throws IOException {
   }
 
   private Stream getInstantsToArchive() throws IOException {
-Stream instants = 
Stream.concat(getCleanInstantsToArchive(), getCommitInstantsToArchive());
+List commitInstantsToArchive = 
getCommitInstantsToArchive().collect(Collectors.toList());
+Stream instants = 
Stream.concat(getCleanInstantsToArchive(), commitInstantsToArchive.stream());
+HoodieInstant hoodieOldestInstantToArchive = 
commitInstantsToArchive.stream().max(Comparator.comparing(maxInstant -> 
maxInstant.getTimestamp())).orElse(null);
+/**
+ * if hoodieOldestInstantToArchive is null that means nothing is getting 
archived, so no need to update metadata
+ */
+if (hoodieOldestInstantToArchive != null) {
+  table.getIndex().updateMetadata(table, 
Option.of(hoodieOldestInstantToArchive));

Review Comment:
   @SteNicholas , understood your concern. Should we run update metadata in 
postcommit of clustering operation?   
https://github.com/apache/hudi/blob/b690346a700121124e2da8eb0013674ba7a0d719/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java#L522.
 
   
   Will that makes sense??
   As new metadata file will get created only with clustering operation in 
consistent hash engine. 
   But there are some concerns what if postcommit operation fails due to some 
reason, then metadata sync will not happen , and metadata state will remain in 
in-consistent state.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rohan-uptycs commented on a diff in pull request #8503: [HUDI-6047] Clustering operation on consistent hashing index resulting in duplicate data

2023-04-26 Thread via GitHub



rohan-uptycs commented on code in PR #8503:
URL: https://github.com/apache/hudi/pull/8503#discussion_r1177555612


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java:
##
@@ -509,7 +509,15 @@ private Stream getCommitInstantsToArchive() 
throws IOException {
   }
 
   private Stream getInstantsToArchive() throws IOException {
-Stream instants = 
Stream.concat(getCleanInstantsToArchive(), getCommitInstantsToArchive());
+List commitInstantsToArchive = 
getCommitInstantsToArchive().collect(Collectors.toList());
+Stream instants = 
Stream.concat(getCleanInstantsToArchive(), commitInstantsToArchive.stream());
+HoodieInstant hoodieOldestInstantToArchive = 
commitInstantsToArchive.stream().max(Comparator.comparing(maxInstant -> 
maxInstant.getTimestamp())).orElse(null);
+/**
+ * if hoodieOldestInstantToArchive is null that means nothing is getting 
archived, so no need to update metadata
+ */
+if (hoodieOldestInstantToArchive != null) {
+  table.getIndex().updateMetadata(table, 
Option.of(hoodieOldestInstantToArchive));

Review Comment:
   @SteNicholas , understood your concern. Should we run update metadata in 
postcommit of clustering operation?   
https://github.com/apache/hudi/blob/b690346a700121124e2da8eb0013674ba7a0d719/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java#L522.
 
   
   Will that makes sense??
   As new metadata file will get created only with clustering operation in 
consistent hash engine. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rohan-uptycs commented on a diff in pull request #8503: [HUDI-6047] Clustering operation on consistent hashing index resulting in duplicate data

2023-04-26 Thread via GitHub



rohan-uptycs commented on code in PR #8503:
URL: https://github.com/apache/hudi/pull/8503#discussion_r1177394054


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java:
##
@@ -509,7 +509,15 @@ private Stream getCommitInstantsToArchive() 
throws IOException {
   }
 
   private Stream getInstantsToArchive() throws IOException {
-Stream instants = 
Stream.concat(getCleanInstantsToArchive(), getCommitInstantsToArchive());
+List commitInstantsToArchive = 
getCommitInstantsToArchive().collect(Collectors.toList());
+Stream instants = 
Stream.concat(getCleanInstantsToArchive(), commitInstantsToArchive.stream());
+HoodieInstant hoodieOldestInstantToArchive = 
commitInstantsToArchive.stream().max(Comparator.comparing(maxInstant -> 
maxInstant.getTimestamp())).orElse(null);
+/**
+ * if hoodieOldestInstantToArchive is null that means nothing is getting 
archived, so no need to update metadata
+ */
+if (hoodieOldestInstantToArchive != null) {
+  table.getIndex().updateMetadata(table, 
Option.of(hoodieOldestInstantToArchive));

Review Comment:
   **getInstantsToArchive** is where archival process getting list of commits 
to archives. What issues you see for making the update **metadata call after 
getInstantsToArchive** ?  Just wanted to understand



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rohan-uptycs commented on a diff in pull request #8503: [HUDI-6047] Clustering operation on consistent hashing index resulting in duplicate data

2023-04-26 Thread via GitHub



rohan-uptycs commented on code in PR #8503:
URL: https://github.com/apache/hudi/pull/8503#discussion_r1177394054


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java:
##
@@ -509,7 +509,15 @@ private Stream getCommitInstantsToArchive() 
throws IOException {
   }
 
   private Stream getInstantsToArchive() throws IOException {
-Stream instants = 
Stream.concat(getCleanInstantsToArchive(), getCommitInstantsToArchive());
+List commitInstantsToArchive = 
getCommitInstantsToArchive().collect(Collectors.toList());
+Stream instants = 
Stream.concat(getCleanInstantsToArchive(), commitInstantsToArchive.stream());
+HoodieInstant hoodieOldestInstantToArchive = 
commitInstantsToArchive.stream().max(Comparator.comparing(maxInstant -> 
maxInstant.getTimestamp())).orElse(null);
+/**
+ * if hoodieOldestInstantToArchive is null that means nothing is getting 
archived, so no need to update metadata
+ */
+if (hoodieOldestInstantToArchive != null) {
+  table.getIndex().updateMetadata(table, 
Option.of(hoodieOldestInstantToArchive));

Review Comment:
   **getInstantsToArchive** is where archival process getting list of commit to 
archives. What issues you see for making the update **metadata call after 
getInstantsToArchive** ? 



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java:
##
@@ -509,7 +509,15 @@ private Stream getCommitInstantsToArchive() 
throws IOException {
   }
 
   private Stream getInstantsToArchive() throws IOException {
-Stream instants = 
Stream.concat(getCleanInstantsToArchive(), getCommitInstantsToArchive());
+List commitInstantsToArchive = 
getCommitInstantsToArchive().collect(Collectors.toList());
+Stream instants = 
Stream.concat(getCleanInstantsToArchive(), commitInstantsToArchive.stream());
+HoodieInstant hoodieOldestInstantToArchive = 
commitInstantsToArchive.stream().max(Comparator.comparing(maxInstant -> 
maxInstant.getTimestamp())).orElse(null);
+/**
+ * if hoodieOldestInstantToArchive is null that means nothing is getting 
archived, so no need to update metadata
+ */
+if (hoodieOldestInstantToArchive != null) {
+  table.getIndex().updateMetadata(table, 
Option.of(hoodieOldestInstantToArchive));

Review Comment:
   **getInstantsToArchive** is where archival process getting list of commit to 
archives. What issues you see for making the update **metadata call after 
getInstantsToArchive** ?  Just wanted to understand



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rohan-uptycs commented on a diff in pull request #8503: [HUDI-6047] Clustering operation on consistent hashing index resulting in duplicate data

2023-04-26 Thread via GitHub



rohan-uptycs commented on code in PR #8503:
URL: https://github.com/apache/hudi/pull/8503#discussion_r1177389386


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java:
##
@@ -509,7 +509,15 @@ private Stream getCommitInstantsToArchive() 
throws IOException {
   }
 
   private Stream getInstantsToArchive() throws IOException {
-Stream instants = 
Stream.concat(getCleanInstantsToArchive(), getCommitInstantsToArchive());
+List commitInstantsToArchive = 
getCommitInstantsToArchive().collect(Collectors.toList());
+Stream instants = 
Stream.concat(getCleanInstantsToArchive(), commitInstantsToArchive.stream());
+HoodieInstant hoodieOldestInstantToArchive = 
commitInstantsToArchive.stream().max(Comparator.comparing(maxInstant -> 
maxInstant.getTimestamp())).orElse(null);
+/**

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rohan-uptycs commented on a diff in pull request #8503: [HUDI-6047] Clustering operation on consistent hashing index resulting in duplicate data

2023-04-26 Thread via GitHub



rohan-uptycs commented on code in PR #8503:
URL: https://github.com/apache/hudi/pull/8503#discussion_r1177386709


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java:
##
@@ -509,7 +509,15 @@ private Stream getCommitInstantsToArchive() 
throws IOException {
   }
 
   private Stream getInstantsToArchive() throws IOException {
-Stream instants = 
Stream.concat(getCleanInstantsToArchive(), getCommitInstantsToArchive());
+List commitInstantsToArchive = 
getCommitInstantsToArchive().collect(Collectors.toList());
+Stream instants = 
Stream.concat(getCleanInstantsToArchive(), commitInstantsToArchive.stream());
+HoodieInstant hoodieOldestInstantToArchive = 
commitInstantsToArchive.stream().max(Comparator.comparing(maxInstant -> 
maxInstant.getTimestamp())).orElse(null);
+/**
+ * if hoodieOldestInstantToArchive is null that means nothing is getting 
archived, so no need to update metadata
+ */
+if (hoodieOldestInstantToArchive != null) {
+  table.getIndex().updateMetadata(table, 
Option.of(hoodieOldestInstantToArchive));

Review Comment:
   Okay, so from where this function should get called ??



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rohan-uptycs commented on a diff in pull request #8503: [HUDI-6047] Clustering operation on consistent hashing index resulting in duplicate data

2023-04-26 Thread via GitHub



rohan-uptycs commented on code in PR #8503:
URL: https://github.com/apache/hudi/pull/8503#discussion_r1177385951


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java:
##
@@ -509,7 +509,15 @@ private Stream getCommitInstantsToArchive() 
throws IOException {
   }
 
   private Stream getInstantsToArchive() throws IOException {
-Stream instants = 
Stream.concat(getCleanInstantsToArchive(), getCommitInstantsToArchive());
+List commitInstantsToArchive = 
getCommitInstantsToArchive().collect(Collectors.toList());
+Stream instants = 
Stream.concat(getCleanInstantsToArchive(), commitInstantsToArchive.stream());
+HoodieInstant hoodieOldestInstantToArchive = 
commitInstantsToArchive.stream().max(Comparator.comparing(maxInstant -> 
maxInstant.getTimestamp())).orElse(null);
+/**

Review Comment:
   Sure



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rohan-uptycs commented on a diff in pull request #8503: [HUDI-6047] Clustering operation on consistent hashing index resulting in duplicate data

2023-04-24 Thread via GitHub



rohan-uptycs commented on code in PR #8503:
URL: https://github.com/apache/hudi/pull/8503#discussion_r1176005518


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java:
##
@@ -441,6 +441,8 @@ private Stream getCommitInstantsToArchive() 
throws IOException {
   Option oldestInstantToRetainForClustering =
   
ClusteringUtils.getOldestInstantToRetainForClustering(table.getActiveTimeline(),
 table.getMetaClient());
 
+  table.getIndex().updateMetadata(table);
+

Review Comment:
   @SteNicholas , make sense. Updated the code to call after 
**getCommitInstantsToArchive** .
   Please review 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rohan-uptycs commented on a diff in pull request #8503: [HUDI-6047] Clustering operation on consistent hashing index resulting in duplicate data

2023-04-24 Thread via GitHub



rohan-uptycs commented on code in PR #8503:
URL: https://github.com/apache/hudi/pull/8503#discussion_r1175238242


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/bucket/HoodieSparkConsistentBucketIndex.java:
##
@@ -275,4 +279,62 @@ public Option 
getRecordLocation(HoodieKey key) {
   throw new HoodieIndexException("Failed to getBucket as hashing node has 
no file group");
 }
   }
+
+  /**
+   * Update default metadata file(00.hashing_meta) with the latest 
committed metadata file so that default file will be in sync
+   * with latest commit.
+   *
+   * @param table
+   */
+  public void updateMetadata(HoodieTable table) {
+Map partitionVisiteddMap = new HashMap<>();
+Option hoodieOldestReplaceInstantToKeep = 
getOldestInstantToRetain(table);

Review Comment:
   Removed redundant code. modified interface



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rohan-uptycs commented on a diff in pull request #8503: [HUDI-6047] Clustering operation on consistent hashing index resulting in duplicate data

2023-04-24 Thread via GitHub



rohan-uptycs commented on code in PR #8503:
URL: https://github.com/apache/hudi/pull/8503#discussion_r1175237410


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/bucket/HoodieSparkConsistentBucketIndex.java:
##
@@ -275,4 +279,65 @@ public Option 
getRecordLocation(HoodieKey key) {
   throw new HoodieIndexException("Failed to getBucket as hashing node has 
no file group");
 }
   }
+
+  /**
+   * Update default metadata file(00.hashing_meta) with the latest 
committed metadata file so that default file will be in sync
+   * with latest commit.
+   *
+   * @param table
+   */
+  public void updateMetadata(HoodieTable table) {
+Map partitionVisiteddMap = new HashMap<>();
+Option hoodieOldestReplaceInstantToKeep = 
getOldestInstantToRetain(table);
+// Update metadata for replace commit which are going to get archived.
+HoodieTimeline hoodieTimeline = 
table.getActiveTimeline().getCompletedReplaceTimeline().filter(instant ->
+hoodieOldestReplaceInstantToKeep.map(replaceInstantToKeep -> 
HoodieTimeline.compareTimestamps(instant.getTimestamp(), LESSER_THAN, 
replaceInstantToKeep.getTimestamp())).orElse(true));
+hoodieTimeline.getInstants().forEach(instant -> {
+  Option> instantPlanPair =
+  ClusteringUtils.getClusteringPlan(table.getMetaClient(), 
HoodieTimeline.getReplaceCommitRequestedInstant(instant.getTimestamp()));
+  if (instantPlanPair.isPresent()) {
+HoodieClusteringPlan plan = instantPlanPair.get().getRight();
+List> partitionMapList = 
plan.getInputGroups().stream().map(HoodieClusteringGroup::getExtraMetadata).collect(Collectors.toList());
+partitionMapList.stream().forEach(partitionMap -> {
+  String partition = 
partitionMap.get(SparkConsistentBucketClusteringPlanStrategy.METADATA_PARTITION_KEY);
+  if (!partitionVisiteddMap.containsKey(partition)) {
+Option 
hoodieConsistentHashingMetadataOption = loadMetadata(table, partition);
+if (hoodieConsistentHashingMetadataOption.isPresent()) {
+  try {
+overWriteMetadata(table, 
hoodieConsistentHashingMetadataOption.get(), HoodieTimeline.INIT_INSTANT_TS + 
HASHING_METADATA_FILE_SUFFIX);
+  } catch (IOException e) {
+throw new RuntimeException(e);
+  }
+}
+partitionVisiteddMap.put(partition, Boolean.TRUE);
+  }
+});
+  }
+});
+  }
+
+  private Option getOldestInstantToRetain(HoodieTable table) {
+try {
+  Option oldestInstantToRetainForClustering =
+  
ClusteringUtils.getOldestInstantToRetainForClustering(table.getActiveTimeline(),
 table.getMetaClient());
+  return oldestInstantToRetainForClustering;
+} catch (IOException e) {
+  LOG.error("Error while getting oldest instant to retain info: ", e);
+  return Option.empty();
+}
+  }
+
+  private boolean overwriteMetadata(HoodieTable table, 
HoodieConsistentHashingMetadata metadata, String fileName) throws IOException {
+HoodieWrapperFileSystem fs = table.getMetaClient().getFs();
+Path dir = 
FSUtils.getPartitionPath(table.getMetaClient().getHashingMetadataPath(), 
metadata.getPartitionPath());
+Path fullPath = new Path(dir, fileName);
+try (FSDataOutputStream fsOut = fs.create(fullPath, true)) {
+   byte[] bytes = metadata.toBytes();
+   fsOut.write(bytes);
+}
+byte[] bytes = metadata.toBytes();

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rohan-uptycs commented on a diff in pull request #8503: [HUDI-6047] Clustering operation on consistent hashing index resulting in duplicate data

2023-04-24 Thread via GitHub



rohan-uptycs commented on code in PR #8503:
URL: https://github.com/apache/hudi/pull/8503#discussion_r1175227516


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/bucket/HoodieSparkConsistentBucketIndex.java:
##
@@ -275,4 +279,62 @@ public Option 
getRecordLocation(HoodieKey key) {
   throw new HoodieIndexException("Failed to getBucket as hashing node has 
no file group");
 }
   }
+
+  /**
+   * Update default metadata file(00.hashing_meta) with the latest 
committed metadata file so that default file will be in sync
+   * with latest commit.
+   *
+   * @param table
+   */
+  public void updateMetadata(HoodieTable table) {
+Map partitionVisiteddMap = new HashMap<>();
+Option hoodieOldestReplaceInstantToKeep = 
getOldestInstantToRetain(table);

Review Comment:
   Yeah i can make the change in interface. let me do it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rohan-uptycs commented on a diff in pull request #8503: [HUDI-6047] Clustering operation on consistent hashing index resulting in duplicate data

2023-04-24 Thread via GitHub



rohan-uptycs commented on code in PR #8503:
URL: https://github.com/apache/hudi/pull/8503#discussion_r1175226808


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/bucket/HoodieSparkConsistentBucketIndex.java:
##
@@ -275,4 +279,62 @@ public Option 
getRecordLocation(HoodieKey key) {
   throw new HoodieIndexException("Failed to getBucket as hashing node has 
no file group");
 }
   }
+
+  /**
+   * Update default metadata file(00.hashing_meta) with the latest 
committed metadata file so that default file will be in sync
+   * with latest commit.
+   *
+   * @param table
+   */
+  public void updateMetadata(HoodieTable table) {
+Map partitionVisiteddMap = new HashMap<>();
+Option hoodieOldestReplaceInstantToKeep = 
getOldestInstantToRetain(table);
+// Update metadata for replace commit which are going to get archived.
+HoodieTimeline hoodieTimeline = 
table.getActiveTimeline().getCompletedReplaceTimeline().filter(instant ->
+hoodieOldestReplaceInstantToKeep.map(replaceInstantToKeep -> 
HoodieTimeline.compareTimestamps(instant.getTimestamp(), LESSER_THAN, 
replaceInstantToKeep.getTimestamp())).orElse(true));
+hoodieTimeline.getInstants().forEach(instant -> {
+  Option> instantPlanPair =
+  ClusteringUtils.getClusteringPlan(table.getMetaClient(), 
HoodieTimeline.getReplaceCommitRequestedInstant(instant.getTimestamp()));
+  if (instantPlanPair.isPresent()) {
+HoodieClusteringPlan plan = instantPlanPair.get().getRight();
+List> partitionMapList = 
plan.getInputGroups().stream().map(HoodieClusteringGroup::getExtraMetadata).collect(Collectors.toList());
+partitionMapList.stream().forEach(partitionMap -> {
+  String partition = 
partitionMap.get(SparkConsistentBucketClusteringPlanStrategy.METADATA_PARTITION_KEY);
+  if (!partitionVisiteddMap.containsKey(partition)) {
+Option 
hoodieConsistentHashingMetadataOption = loadMetadata(table, partition);
+if (hoodieConsistentHashingMetadataOption.isPresent()) {
+  try {
+overWriteMetadata(table, 
hoodieConsistentHashingMetadataOption.get(), HoodieTimeline.INIT_INSTANT_TS + 
HASHING_METADATA_FILE_SUFFIX);
+  } catch (IOException e) {
+throw new RuntimeException(e);
+  }
+}
+partitionVisiteddMap.put(partition, Boolean.TRUE);
+  }
+});
+  }
+});
+  }
+
+  private Option getOldestInstantToRetain(HoodieTable table) {
+try {
+  Option oldestInstantToRetainForClustering =
+  
ClusteringUtils.getOldestInstantToRetainForClustering(table.getActiveTimeline(),
 table.getMetaClient());
+  return oldestInstantToRetainForClustering;
+} catch (IOException e) {
+  LOG.error("Error while getting oldest instant to retain info: ", e);
+  return Option.empty();
+}
+  }
+
+  private boolean overWriteMetadata(HoodieTable table, 
HoodieConsistentHashingMetadata metadata, String fileName) throws IOException {
+HoodieWrapperFileSystem fs = table.getMetaClient().getFs();
+Path dir = 
FSUtils.getPartitionPath(table.getMetaClient().getHashingMetadataPath(), 
metadata.getPartitionPath());
+Path fullPath = new Path(dir, fileName);
+FSDataOutputStream fsOut = fs.create(fullPath, true);
+byte[] bytes = metadata.toBytes();
+fsOut.write(bytes);
+fsOut.close();

Review Comment:
   Got it, thanks. committed the suggestions



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rohan-uptycs commented on a diff in pull request #8503: [HUDI-6047] Clustering operation on consistent hashing index resulting in duplicate data

2023-04-24 Thread via GitHub



rohan-uptycs commented on code in PR #8503:
URL: https://github.com/apache/hudi/pull/8503#discussion_r1175181156


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/bucket/HoodieSparkConsistentBucketIndex.java:
##
@@ -275,4 +278,46 @@ public Option 
getRecordLocation(HoodieKey key) {
   throw new HoodieIndexException("Failed to getBucket as hashing node has 
no file group");
 }
   }
+
+  /**
+   * Update default metadata file(00.hashing_meta) with the latest 
committed metadata file so that default file will be in sync
+   * with latest commit.
+   *
+   * @param table
+   */
+  public void updateMetadata(HoodieTable table) {
+Map partitionVisiteddMap = new HashMap<>();
+HoodieTimeline hoodieTimeline = 
table.getActiveTimeline().getCompletedReplaceTimeline();
+hoodieTimeline.getInstants().forEach(instant -> {
+  Option> instantPlanPair =

Review Comment:
   @SteNicholas , Updated the code to update metadata for replaced commits 
which are eligible for archival in the current run. This will avoid 
un-necessary metadata update.
   Please review .



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rohan-uptycs commented on a diff in pull request #8503: [HUDI-6047] Clustering operation on consistent hashing index resulting in duplicate data

2023-04-24 Thread via GitHub



rohan-uptycs commented on code in PR #8503:
URL: https://github.com/apache/hudi/pull/8503#discussion_r1175074684


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/bucket/HoodieSparkConsistentBucketIndex.java:
##
@@ -275,4 +278,46 @@ public Option 
getRecordLocation(HoodieKey key) {
   throw new HoodieIndexException("Failed to getBucket as hashing node has 
no file group");
 }
   }
+
+  /**
+   * Update default metadata file(00.hashing_meta) with the latest 
committed metadata file so that default file will be in sync
+   * with latest commit.
+   *
+   * @param table
+   */
+  public void updateMetadata(HoodieTable table) {
+Map partitionVisiteddMap = new HashMap<>();
+HoodieTimeline hoodieTimeline = 
table.getActiveTimeline().getCompletedReplaceTimeline();
+hoodieTimeline.getInstants().forEach(instant -> {
+  Option> instantPlanPair =

Review Comment:
   @SteNicholas , using all replace commits to update metadata will not cause 
any issue. Its just that in some cases it will update default metadata file for 
particular partitions multiple times, until it get archived.
   
   Ideally it should update metadata for replace commit which are going to get 
archived **replacecommits before oldestInstantToRetainForClustering** .




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rohan-uptycs commented on a diff in pull request #8503: [HUDI-6047] Clustering operation on consistent hashing index resulting in duplicate data

2023-04-24 Thread via GitHub



rohan-uptycs commented on code in PR #8503:
URL: https://github.com/apache/hudi/pull/8503#discussion_r1175074684


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/bucket/HoodieSparkConsistentBucketIndex.java:
##
@@ -275,4 +278,46 @@ public Option 
getRecordLocation(HoodieKey key) {
   throw new HoodieIndexException("Failed to getBucket as hashing node has 
no file group");
 }
   }
+
+  /**
+   * Update default metadata file(00.hashing_meta) with the latest 
committed metadata file so that default file will be in sync
+   * with latest commit.
+   *
+   * @param table
+   */
+  public void updateMetadata(HoodieTable table) {
+Map partitionVisiteddMap = new HashMap<>();
+HoodieTimeline hoodieTimeline = 
table.getActiveTimeline().getCompletedReplaceTimeline();
+hoodieTimeline.getInstants().forEach(instant -> {
+  Option> instantPlanPair =

Review Comment:
   @SteNicholas , using all replace commits to update metadata will not cause 
any issue. Its just that in some cases it will update default metadata file for 
particular partition multiple times, until it get archived.
   
   Ideally it should update metadata for replace commit which are going to get 
archived **replacecommits before oldestInstantToRetainForClustering** .




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rohan-uptycs commented on a diff in pull request #8503: [HUDI-6047] Clustering operation on consistent hashing index resulting in duplicate data

2023-04-24 Thread via GitHub



rohan-uptycs commented on code in PR #8503:
URL: https://github.com/apache/hudi/pull/8503#discussion_r1175074684


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/bucket/HoodieSparkConsistentBucketIndex.java:
##
@@ -275,4 +278,46 @@ public Option 
getRecordLocation(HoodieKey key) {
   throw new HoodieIndexException("Failed to getBucket as hashing node has 
no file group");
 }
   }
+
+  /**
+   * Update default metadata file(00.hashing_meta) with the latest 
committed metadata file so that default file will be in sync
+   * with latest commit.
+   *
+   * @param table
+   */
+  public void updateMetadata(HoodieTable table) {
+Map partitionVisiteddMap = new HashMap<>();
+HoodieTimeline hoodieTimeline = 
table.getActiveTimeline().getCompletedReplaceTimeline();
+hoodieTimeline.getInstants().forEach(instant -> {
+  Option> instantPlanPair =

Review Comment:
   @SteNicholas , using all replace commits to update metadata will not cause 
any issue. Its just that in some cases it will update default metadata file 
multiple times, until it get archived.
   
   Ideally it should update metadata for replace commit which are going to get 
archived **replacecommits before oldestInstantToRetainForClustering** .




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rohan-uptycs commented on a diff in pull request #8503: [HUDI-6047] Clustering operation on consistent hashing index resulting in duplicate data

2023-04-24 Thread via GitHub



rohan-uptycs commented on code in PR #8503:
URL: https://github.com/apache/hudi/pull/8503#discussion_r1174966444


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/bucket/HoodieSparkConsistentBucketIndex.java:
##
@@ -275,4 +278,46 @@ public Option 
getRecordLocation(HoodieKey key) {
   throw new HoodieIndexException("Failed to getBucket as hashing node has 
no file group");
 }
   }
+
+  /**
+   * Update default metadata file(00.hashing_meta) with the latest 
committed metadata file so that default file will be in sync
+   * with latest commit.
+   *
+   * @param table
+   */
+  public void updateMetadata(HoodieTable table) {
+Map partitionVisiteddMap = new HashMap<>();
+HoodieTimeline hoodieTimeline = 
table.getActiveTimeline().getCompletedReplaceTimeline();
+hoodieTimeline.getInstants().forEach(instant -> {
+  Option> instantPlanPair =

Review Comment:
   @SteNicholas , No i  think. **loadMetadata** function refers to active 
timeline(replaced commit) to load metadata for reader and writers. Hence we are 
only interested in replace commit from active timeline.
   And this code will make sure to keep replace commit in sync before archiving 
the replace commit..



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rohan-uptycs commented on a diff in pull request #8503: [HUDI-6047] Clustering operation on consistent hashing index resulting in duplicate data

2023-04-24 Thread via GitHub



rohan-uptycs commented on code in PR #8503:
URL: https://github.com/apache/hudi/pull/8503#discussion_r1174966444


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/bucket/HoodieSparkConsistentBucketIndex.java:
##
@@ -275,4 +278,46 @@ public Option 
getRecordLocation(HoodieKey key) {
   throw new HoodieIndexException("Failed to getBucket as hashing node has 
no file group");
 }
   }
+
+  /**
+   * Update default metadata file(00.hashing_meta) with the latest 
committed metadata file so that default file will be in sync
+   * with latest commit.
+   *
+   * @param table
+   */
+  public void updateMetadata(HoodieTable table) {
+Map partitionVisiteddMap = new HashMap<>();
+HoodieTimeline hoodieTimeline = 
table.getActiveTimeline().getCompletedReplaceTimeline();
+hoodieTimeline.getInstants().forEach(instant -> {
+  Option> instantPlanPair =

Review Comment:
   @SteNicholas , No i  think. **loadMetadata** function refers to active 
timeline(replaced commit) to load metadata for reader and writers. Hence we are 
only interested in replace commit from active timeline.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rohan-uptycs commented on a diff in pull request #8503: [HUDI-6047] Clustering operation on consistent hashing index resulting in duplicate data

2023-04-24 Thread via GitHub



rohan-uptycs commented on code in PR #8503:
URL: https://github.com/apache/hudi/pull/8503#discussion_r1174898317


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/bucket/HoodieSparkConsistentBucketIndex.java:
##
@@ -275,4 +278,46 @@ public Option 
getRecordLocation(HoodieKey key) {
   throw new HoodieIndexException("Failed to getBucket as hashing node has 
no file group");
 }
   }
+
+  /**
+   * Update default metadata file(00.hashing_meta) with the latest 
committed metadata file so that default file will be in sync
+   * with latest commit.
+   *
+   * @param table
+   */
+  public void updateMetadata(HoodieTable table) {
+Map partitionVisiteddMap = new HashMap<>();
+HoodieTimeline hoodieTimeline = 
table.getActiveTimeline().getCompletedReplaceTimeline();
+hoodieTimeline.getInstants().forEach(instant -> {
+  Option> instantPlanPair =

Review Comment:
   Oh okay got it.I don't see any issue with insert overwrite action ,with 
insert overwrite action **ClusteringUtils.getClusteringPlan** will return empty 
plan for insert overwrite action. Just added  check to ignore empty plan.
   
   Thanks for pointing out.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rohan-uptycs commented on a diff in pull request #8503: [HUDI-6047] Clustering operation on consistent hashing index resulting in duplicate data

2023-04-24 Thread via GitHub



rohan-uptycs commented on code in PR #8503:
URL: https://github.com/apache/hudi/pull/8503#discussion_r1174898317


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/bucket/HoodieSparkConsistentBucketIndex.java:
##
@@ -275,4 +278,46 @@ public Option 
getRecordLocation(HoodieKey key) {
   throw new HoodieIndexException("Failed to getBucket as hashing node has 
no file group");
 }
   }
+
+  /**
+   * Update default metadata file(00.hashing_meta) with the latest 
committed metadata file so that default file will be in sync
+   * with latest commit.
+   *
+   * @param table
+   */
+  public void updateMetadata(HoodieTable table) {
+Map partitionVisiteddMap = new HashMap<>();
+HoodieTimeline hoodieTimeline = 
table.getActiveTimeline().getCompletedReplaceTimeline();
+hoodieTimeline.getInstants().forEach(instant -> {
+  Option> instantPlanPair =

Review Comment:
   Oh okay got it.I don't see any issue with insert overwrite action ,with 
insert overwrite action **ClusteringUtils.getClusteringPlan** will return empty 
plan for insert overwrite action. Just added to check to ignore empty plan.
   
   Thanks for pointing out.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rohan-uptycs commented on a diff in pull request #8503: [HUDI-6047] Clustering operation on consistent hashing index resulting in duplicate data

2023-04-24 Thread via GitHub



rohan-uptycs commented on code in PR #8503:
URL: https://github.com/apache/hudi/pull/8503#discussion_r1174790151


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/bucket/HoodieSparkConsistentBucketIndex.java:
##
@@ -275,4 +278,46 @@ public Option 
getRecordLocation(HoodieKey key) {
   throw new HoodieIndexException("Failed to getBucket as hashing node has 
no file group");
 }
   }
+
+  /**
+   * Update default metadata file(00.hashing_meta) with the latest 
committed metadata file so that default file will be in sync
+   * with latest commit.
+   *
+   * @param table
+   */
+  public void updateMetadata(HoodieTable table) {
+Map partitionVisiteddMap = new HashMap<>();
+HoodieTimeline hoodieTimeline = 
table.getActiveTimeline().getCompletedReplaceTimeline();
+hoodieTimeline.getInstants().forEach(instant -> {
+  Option> instantPlanPair =

Review Comment:
   No i think, every clustering operation on consistent hashing index engine 
will create new metadata file, that will be  **.hashing_meta** , 
this particular piece of code will make **00.hashing_meta**  in 
sync with **.hashing_meta**  before triggering archival as 
replace commit might get archived from active timeline.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rohan-uptycs commented on a diff in pull request #8503: [HUDI-6047] Clustering operation on consistent hashing index resulting in duplicate data

2023-04-23 Thread via GitHub



rohan-uptycs commented on code in PR #8503:
URL: https://github.com/apache/hudi/pull/8503#discussion_r1174790151


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/bucket/HoodieSparkConsistentBucketIndex.java:
##
@@ -275,4 +278,46 @@ public Option 
getRecordLocation(HoodieKey key) {
   throw new HoodieIndexException("Failed to getBucket as hashing node has 
no file group");
 }
   }
+
+  /**
+   * Update default metadata file(00.hashing_meta) with the latest 
committed metadata file so that default file will be in sync
+   * with latest commit.
+   *
+   * @param table
+   */
+  public void updateMetadata(HoodieTable table) {
+Map partitionVisiteddMap = new HashMap<>();
+HoodieTimeline hoodieTimeline = 
table.getActiveTimeline().getCompletedReplaceTimeline();
+hoodieTimeline.getInstants().forEach(instant -> {
+  Option> instantPlanPair =

Review Comment:
   No i think, every clustering operation on consistent hashing index engine 
will create new metadata file, that will be  **.hashing_meta** , 
this particular piece of code will make **00.hashing_meta**  in 
sync with **.hashing_meta**  before triggering of archival, as 
replace commit might get archived from active timeline.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rohan-uptycs commented on a diff in pull request #8503: [HUDI-6047] Clustering operation on consistent hashing index resulting in duplicate data

2023-04-23 Thread via GitHub



rohan-uptycs commented on code in PR #8503:
URL: https://github.com/apache/hudi/pull/8503#discussion_r1174790151


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/bucket/HoodieSparkConsistentBucketIndex.java:
##
@@ -275,4 +278,46 @@ public Option 
getRecordLocation(HoodieKey key) {
   throw new HoodieIndexException("Failed to getBucket as hashing node has 
no file group");
 }
   }
+
+  /**
+   * Update default metadata file(00.hashing_meta) with the latest 
committed metadata file so that default file will be in sync
+   * with latest commit.
+   *
+   * @param table
+   */
+  public void updateMetadata(HoodieTable table) {
+Map partitionVisiteddMap = new HashMap<>();
+HoodieTimeline hoodieTimeline = 
table.getActiveTimeline().getCompletedReplaceTimeline();
+hoodieTimeline.getInstants().forEach(instant -> {
+  Option> instantPlanPair =

Review Comment:
   No i think, every clustering operation on consistent hashing index engine 
will create new metadata file, that will be  **.hashing_meta** , 
this particular piece of code will make **00.hashing_meta**  in 
sync with **.hashing_meta**  before triggering of archival. To 
avoid inconsistency in bucket index file groups.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rohan-uptycs commented on a diff in pull request #8503: [HUDI-6047] Clustering operation on consistent hashing index resulting in duplicate data

2023-04-23 Thread via GitHub



rohan-uptycs commented on code in PR #8503:
URL: https://github.com/apache/hudi/pull/8503#discussion_r1174790151


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/bucket/HoodieSparkConsistentBucketIndex.java:
##
@@ -275,4 +278,46 @@ public Option 
getRecordLocation(HoodieKey key) {
   throw new HoodieIndexException("Failed to getBucket as hashing node has 
no file group");
 }
   }
+
+  /**
+   * Update default metadata file(00.hashing_meta) with the latest 
committed metadata file so that default file will be in sync
+   * with latest commit.
+   *
+   * @param table
+   */
+  public void updateMetadata(HoodieTable table) {
+Map partitionVisiteddMap = new HashMap<>();
+HoodieTimeline hoodieTimeline = 
table.getActiveTimeline().getCompletedReplaceTimeline();
+hoodieTimeline.getInstants().forEach(instant -> {
+  Option> instantPlanPair =

Review Comment:
   No i think, every clustering operation on consistent hashing index engine 
will create new metadata file, that will be  **.hashing_meta** , 
this particular piece of code will make **.hashing_meta** in sync 
with **00.hashing_meta** before triggering of archival. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rohan-uptycs commented on a diff in pull request #8503: [HUDI-6047] Clustering operation on consistent hashing index resulting in duplicate data

2023-04-20 Thread via GitHub



rohan-uptycs commented on code in PR #8503:
URL: https://github.com/apache/hudi/pull/8503#discussion_r1172085324


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java:
##
@@ -441,6 +441,8 @@ private Stream getCommitInstantsToArchive() 
throws IOException {
   Option oldestInstantToRetainForClustering =
   
ClusteringUtils.getOldestInstantToRetainForClustering(table.getActiveTimeline(),
 table.getMetaClient());
 
+  table.getIndex().updateMetadata(table);
+

Review Comment:
   The archival process will archive replace commit from active timeline, once 
it does that , all the hudi writer will start referring default metadata index 
file that is **00.hashing_meta** , **check the function 
loadMetadata(HoodieTable table, String partition)**.  That's the reason it is 
necessary to trigger the update metadata function before archival , so that it 
will bring 00.hashing_meta  file in sync with latest metadata 
commit file . 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rohan-uptycs commented on a diff in pull request #8503: [HUDI-6047] Clustering operation on consistent hashing index resulting in duplicate data

2023-04-19 Thread via GitHub



rohan-uptycs commented on code in PR #8503:
URL: https://github.com/apache/hudi/pull/8503#discussion_r1172085324


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java:
##
@@ -441,6 +441,8 @@ private Stream getCommitInstantsToArchive() 
throws IOException {
   Option oldestInstantToRetainForClustering =
   
ClusteringUtils.getOldestInstantToRetainForClustering(table.getActiveTimeline(),
 table.getMetaClient());
 
+  table.getIndex().updateMetadata(table);
+

Review Comment:
   The archival process will archive replace commit from active timeline, once 
it does that , all the hudi writer will start referring default metadata index 
file that is **00.hashing_meta** , **check the function 
loadMetadata(HoodieTable table, String partition)**.  That's the reason it is 
necessary to trigger the update metadata function before archival , so that it 
will bring latest metadata commit file in sync with 
00.hashing_meta. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rohan-uptycs commented on a diff in pull request #8503: [HUDI-6047] Clustering operation on consistent hashing index resulting in duplicate data

2023-04-19 Thread via GitHub



rohan-uptycs commented on code in PR #8503:
URL: https://github.com/apache/hudi/pull/8503#discussion_r1172085324


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java:
##
@@ -441,6 +441,8 @@ private Stream getCommitInstantsToArchive() 
throws IOException {
   Option oldestInstantToRetainForClustering =
   
ClusteringUtils.getOldestInstantToRetainForClustering(table.getActiveTimeline(),
 table.getMetaClient());
 
+  table.getIndex().updateMetadata(table);
+

Review Comment:
   The archival process will archive replace commit from active timeline, once 
it does that , all the hudi writer will start referring default metadata index 
file that is **00.hashing_meta** , **check the function 
loadMetadata(HoodieTable table, String partition)**.  That's the reason it is 
necessary to trigger the update metadata function , so that it will bring 
latest metadata commit file in sync with 00.hashing_meta. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

50 matches

Mail list logo