[GitHub] [hudi] nsivabalan commented on a change in pull request #4716: [HUDI-3322][HUDI-3343] Fixing Metadata Table Records Duplication Issues
nsivabalan commented on a change in pull request #4716: URL: https://github.com/apache/hudi/pull/4716#discussion_r797157620 ## File path: hudi-common/src/main/avro/HoodieRollbackMetadata.avsc ## @@ -38,14 +38,6 @@ "type": "long", "doc": "Size of this file in bytes" } -}], "default":null }, Review comment: hmmm, interesting. Can you try this out explicitly. write an avro using master branch. and then try to read it using this branch. or you can just try it out using a stand alone java main class too, which ever works. just wanted to ensure we are good here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a change in pull request #4716: [HUDI-3322][HUDI-3343] Fixing Metadata Table Records Duplication Issues
nsivabalan commented on a change in pull request #4716: URL: https://github.com/apache/hudi/pull/4716#discussion_r797155269 ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/rollback/MarkerBasedRollbackStrategy.java ## @@ -90,42 +86,41 @@ public MarkerBasedRollbackStrategy(HoodieTable table, HoodieEngineCo Collections.singletonList(fullDeletePath.toString()), Collections.emptyMap()); case APPEND: +// NOTE: This marker file-path does NOT correspond to a log-file, but rather is a phony +// path serving as a "container" for the following components: +// - Base file's file-id +// - Base file's commit instant +// - Partition path return getRollbackRequestForAppend(WriteMarkers.stripMarkerSuffix(markerFilePath)); default: throw new HoodieRollbackException("Unknown marker type, during rollback of " + instantToRollback); } - }, parallelism).stream().collect(Collectors.toList()); + }, parallelism); } catch (Exception e) { throw new HoodieRollbackException("Error rolling back using marker files written for " + instantToRollback, e); } } - protected HoodieRollbackRequest getRollbackRequestForAppend(String appendBaseFilePath) throws IOException { -Path baseFilePathForAppend = new Path(basePath, appendBaseFilePath); + protected HoodieRollbackRequest getRollbackRequestForAppend(String markerFilePath) throws IOException { +Path baseFilePathForAppend = new Path(basePath, markerFilePath); String fileId = FSUtils.getFileIdFromFilePath(baseFilePathForAppend); String baseCommitTime = FSUtils.getCommitTime(baseFilePathForAppend.getName()); -String partitionPath = FSUtils.getRelativePartitionPath(new Path(basePath), new Path(basePath, appendBaseFilePath).getParent()); -Map writtenLogFileSizeMap = getWrittenLogFileSizeMap(partitionPath, baseCommitTime, fileId); -Map writtenLogFileStrSizeMap = new HashMap<>(); -for (Map.Entry entry : writtenLogFileSizeMap.entrySet()) { - writtenLogFileStrSizeMap.put(entry.getKey().getPath().toString(), entry.getValue()); -} -return new HoodieRollbackRequest(partitionPath, fileId, baseCommitTime, Collections.emptyList(), writtenLogFileStrSizeMap); +String relativePartitionPath = FSUtils.getRelativePartitionPath(new Path(basePath), baseFilePathForAppend.getParent()); +Path partitionPath = FSUtils.getPartitionPath(config.getBasePath(), relativePartitionPath); + +// NOTE: Since we're rolling back incomplete Delta Commit, it only could have appended its +// block to the latest log-file +// TODO(HUDI-1517) use provided marker-file's path instead +HoodieLogFile latestLogFile = FSUtils.getLatestLogFile(table.getMetaClient().getFs(), partitionPath, fileId, +HoodieFileFormat.HOODIE_LOG.getFileExtension(), baseCommitTime).get(); + +// NOTE: Marker's don't carry information about the cumulative size of the blocks that have been appended, +// therefore we simply stub this value. +Map logFilesWithBlocsToRollback = Review comment: based on offline discussion, this will case an issue in storage schemes like hdfs where rollback block could be added to existing log file. will take it up as a follow up PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a change in pull request #4716: [HUDI-3322][HUDI-3343] Fixing Metadata Table Records Duplication Issues
nsivabalan commented on a change in pull request #4716: URL: https://github.com/apache/hudi/pull/4716#discussion_r797153919 ## File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/SparkRDDWriteClient.java ## @@ -442,8 +443,11 @@ private void updateTableMetadata(HoodieTable>, JavaRD metaClient, config, context, SparkUpgradeDowngradeHelper.getInstance()) .run(HoodieTableVersion.current(), instantTime); metaClient.reloadActiveTimeline(); -initializeMetadataTable(Option.of(instantTime)); } + // Initialize Metadata Table to make sure it's bootstrapped _before_ the operation, + // if it didn't exist before + // See https://issues.apache.org/jira/browse/HUDI-3343 for more details + initializeMetadataTable(Option.of(instantTime)); Review comment: based on offline discussion, we are good here. MDT intitalization may not kick in if there are any pending operations in data table -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a change in pull request #4716: [HUDI-3322][HUDI-3343] Fixing Metadata Table Records Duplication Issues
nsivabalan commented on a change in pull request #4716: URL: https://github.com/apache/hudi/pull/4716#discussion_r795095166 ## File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/SparkRDDWriteClient.java ## @@ -442,8 +443,11 @@ private void updateTableMetadata(HoodieTable>, JavaRD metaClient, config, context, SparkUpgradeDowngradeHelper.getInstance()) .run(HoodieTableVersion.current(), instantTime); metaClient.reloadActiveTimeline(); -initializeMetadataTable(Option.of(instantTime)); } + // Initialize Metadata Table to make sure it's bootstrapped _before_ the operation, + // if it didn't exist before + // See https://issues.apache.org/jira/browse/HUDI-3343 for more details + initializeMetadataTable(Option.of(instantTime)); Review comment: I get the fix here. But lets say there is a failed commit just before this current writer. Wouldn't the MDT bootstrap also include the files from the failed commit ? Or are we gonna rely on MDT bootstrap will not happen if there is any pending operation in data table. And so unless the rollback of the failed commit completes, we may not trigger MDT bootstrap. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a change in pull request #4716: [HUDI-3322][HUDI-3343] Fixing Metadata Table Records Duplication Issues
nsivabalan commented on a change in pull request #4716: URL: https://github.com/apache/hudi/pull/4716#discussion_r795093095 ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/rollback/MarkerBasedRollbackStrategy.java ## @@ -90,42 +86,41 @@ public MarkerBasedRollbackStrategy(HoodieTable table, HoodieEngineCo Collections.singletonList(fullDeletePath.toString()), Collections.emptyMap()); case APPEND: +// NOTE: This marker file-path does NOT correspond to a log-file, but rather is a phony +// path serving as a "container" for the following components: +// - Base file's file-id +// - Base file's commit instant +// - Partition path return getRollbackRequestForAppend(WriteMarkers.stripMarkerSuffix(markerFilePath)); default: throw new HoodieRollbackException("Unknown marker type, during rollback of " + instantToRollback); } - }, parallelism).stream().collect(Collectors.toList()); + }, parallelism); } catch (Exception e) { throw new HoodieRollbackException("Error rolling back using marker files written for " + instantToRollback, e); } } - protected HoodieRollbackRequest getRollbackRequestForAppend(String appendBaseFilePath) throws IOException { -Path baseFilePathForAppend = new Path(basePath, appendBaseFilePath); + protected HoodieRollbackRequest getRollbackRequestForAppend(String markerFilePath) throws IOException { +Path baseFilePathForAppend = new Path(basePath, markerFilePath); String fileId = FSUtils.getFileIdFromFilePath(baseFilePathForAppend); String baseCommitTime = FSUtils.getCommitTime(baseFilePathForAppend.getName()); -String partitionPath = FSUtils.getRelativePartitionPath(new Path(basePath), new Path(basePath, appendBaseFilePath).getParent()); -Map writtenLogFileSizeMap = getWrittenLogFileSizeMap(partitionPath, baseCommitTime, fileId); -Map writtenLogFileStrSizeMap = new HashMap<>(); -for (Map.Entry entry : writtenLogFileSizeMap.entrySet()) { - writtenLogFileStrSizeMap.put(entry.getKey().getPath().toString(), entry.getValue()); -} -return new HoodieRollbackRequest(partitionPath, fileId, baseCommitTime, Collections.emptyList(), writtenLogFileStrSizeMap); +String relativePartitionPath = FSUtils.getRelativePartitionPath(new Path(basePath), baseFilePathForAppend.getParent()); +Path partitionPath = FSUtils.getPartitionPath(config.getBasePath(), relativePartitionPath); + +// NOTE: Since we're rolling back incomplete Delta Commit, it only could have appended its +// block to the latest log-file +// TODO(HUDI-1517) use provided marker-file's path instead +HoodieLogFile latestLogFile = FSUtils.getLatestLogFile(table.getMetaClient().getFs(), partitionPath, fileId, +HoodieFileFormat.HOODIE_LOG.getFileExtension(), baseCommitTime).get(); + +// NOTE: Marker's don't carry information about the cumulative size of the blocks that have been appended, +// therefore we simply stub this value. +Map logFilesWithBlocsToRollback = Review comment: I was also thinking if we should fix the way we store the file size in HoodieMetadataPayload. ``` HoodieMetadataFileInfo(long size, boolean isFullSize, boolean isDelete) ``` With CommitMetadata, these will represent delta sizes. and during rollback/restore, these will represent full sizes. We can combine multiple metadata records based on whether its delta or full size. ## File path: hudi-common/src/main/avro/HoodieRollbackMetadata.avsc ## @@ -38,14 +38,6 @@ "type": "long", "doc": "Size of this file in bytes" } -}], "default":null }, Review comment: this may not be backwards compatible while reading rollback metadata written w/ 0.10.0 or previous versions. ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/rollback/MarkerBasedRollbackStrategy.java ## @@ -90,42 +86,41 @@ public MarkerBasedRollbackStrategy(HoodieTable table, HoodieEngineCo Collections.singletonList(fullDeletePath.toString()), Collections.emptyMap()); case APPEND: +// NOTE: This marker file-path does NOT correspond to a log-file, but rather is a phony +// path serving as a "container" for the following components: +// - Base file's file-id +// - Base file's commit instant +// - Partition path return getRollbackRequestForAppend(WriteMarkers.stripMarkerSuffix(markerFilePath)); default: throw new HoodieRollbackException("Unknown marker type, during rollback of " + instantToRol