satishkotha commented on a change in pull request #2048: URL: https://github.com/apache/hudi/pull/2048#discussion_r488207278
########## File path: hudi-client/src/main/java/org/apache/hudi/table/HoodieTimelineArchiveLog.java ########## @@ -301,6 +304,61 @@ private void deleteAnyLeftOverMarkerFiles(JavaSparkContext jsc, HoodieInstant in } } + private void deleteReplacedFiles(HoodieInstant instant) { + if (!instant.isCompleted()) { + // only delete files for completed instants + return; + } + + TableFileSystemView fileSystemView = this.table.getFileSystemView(); + ensureReplacedPartitionsLoadedCorrectly(instant, fileSystemView); + + Stream<HoodieFileGroup> fileGroupsToDelete = fileSystemView Review comment: Yes, compaction is the primary reason I only recorded fileId in the replace metadata. When deleting, we can get all file paths (through view or by listing using consolidated metadata) that have same fileId and delete these files. There can be race conditions that compaction might create a new file with replaced fileId after we queried for existing files though. But because FileSystemView#get methods do not include replaced file groups, I think this is unlikely to happen. I'm not sure if there are edge cases with long running compactions. Please suggest any other improvements. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org