RussellSpitzer edited a comment on pull request #2591:
URL: https://github.com/apache/iceberg/pull/2591#issuecomment-845608044


   > Overall looks good to me. I would like to revisit the RewriteStrategy idea 
a bit. Because we are basically going to rewrite and remove all the delete 
files in this action along the way, this is what I see as the method for 
running major compaction.
   > 
   We cannot and are not deleting delete files in this action because it's 
actually much more difficult to find out which delete files are no longer in 
use than just checking which ones are referred to by the FileScanTasks for the 
files we are looking at. For example, we may determine File A should be read in 
conjunction with Delete File 1, and that it should be split into multiple 
files. We cannot remove DeleteFile 1 because it may also apply to Files B, and 
C which we didn't even consider.
   
   To actually determine delete files you must check the delete file against 
every valid file (not the other way around) and only if the delete file cannot 
be applied to any live files can you mark it as non-live (trying to avoid 
deleted deleted file).
   
   Because of that complexity we were hoping to move DeleteFiles compaction and 
cleanup to another action entirely. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to