RussellSpitzer commented on pull request #2591:
URL: https://github.com/apache/iceberg/pull/2591#issuecomment-845628652


   > > We cannot and are not deleting delete files in this action because it's 
actually much more difficult to find out which delete files are no longer in 
use than just checking which ones are referred to by the FileScanTasks for the 
files we are looking at.
   > 
   > Yes I am aware of this condition, but the delete files are applied for 
each file scan task anyway, it's just we cannot remove it because of the 
condition you described, and we have to call another action, do double work to 
fully remove the file. Conversely, say we have another action to only remove 
delete files, then we are reading those delete files anyway, and it also feels 
wasteful to me that we have to do another bin pack after deleting those files 
to make the files more optimized, and potentially cause more commit conflicts.
   
   They are only applied to the files we are looking at. It's not the reading 
of the delete files that is expensive, to determine if a delete file isn't 
needed we have to scan through *every* data file they may apply to. After the 
bin pack we actually remove the files that are rewritten from the pool of files 
the Delete files may apply too but there is no way to check the rest without 
reading every file a delete file applies too. That's why it's completely 
independent, you don't actually do any work while rewriting files that benefits 
the delete file situation for other files.
   
   Delete file A may touch Files A, B , C ,D
   We plan for files A and B
   After running binpack we Read A and B and write A' and B' we never touch C 
and D.
   Delete File A may now only touch C, and D 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to