RussellSpitzer edited a comment on pull request #2591: URL: https://github.com/apache/iceberg/pull/2591#issuecomment-845608044
> Overall looks good to me. I would like to revisit the RewriteStrategy idea a bit. Because we are basically going to rewrite and remove all the delete files in this action along the way, this is what I see as the method for running major compaction. > We cannot and are not deleting delete files in this action because it's actually much more difficult to find out which delete files are no longer in use than just checking which ones are referred to by the FileScanTasks for the files we are looking at. For example, we may determine File A should be read in conjunction with Delete File 1, and that it should be split into multiple files. We cannot remove DeleteFile 1 because it may also apply to Files B, and C which we didn't even consider. To actually determine delete files you must check the delete file against every valid file (not the other way around) and only if the delete file cannot be applied to any live files can you mark it as non-live (trying to avoid deleted deleted file). Because of that complexity we were hoping to move DeleteFiles compaction and cleanup to another action entirely. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
