jackye1995 commented on pull request #2591:
URL: https://github.com/apache/iceberg/pull/2591#issuecomment-845621625


   > We cannot and are not deleting delete files in this action because it's 
actually much more difficult to find out which delete files are no longer in 
use than just checking which ones are referred to by the FileScanTasks for the 
files we are looking at.
   
   Yes I am aware of this condition, but the delete files are applied for each 
file scan task anyway, it's just we cannot remove it because of the condition 
you described, and we have to call another action, do double work to fully 
remove the file. Conversely, say we have another action to only remove delete 
files, then we are reading those delete files anyway, and it also feels 
wasteful to me that we have to do another bin pack after deleting those files 
to make the files more optimized, and potentially cause more commit conflicts.
   
   I understand there is a good separation of concern if we do them as 2 
different actions. But when I try to imagine what the compaction API looks 
like, it seems that I just need a different `selectFilesToRewrite` 
implementation of the rewrite strategy, and other things can mostly be reused 
with just a few small branching logic.
   
   So instead of having another totally different action that removes delete 
file, the major compaction can potentially be done as just an extension to the 
existing strategy or a replacement to run a different strategy. For example, we 
can extend the current bin pack strategy with: if there are delete files in a 
file scan task, then the data file must be included for rewriting. We can also 
plug in a strategy that try to select all data files based on a certain delete 
file threshold, etc.
   
   We can get more clever about that as we evolve, but the general thought I 
have is that having data file rewriting and delete file compaction as one base 
action with different strategies to satisfy different use cases seems to be a 
more efficient and flexible way to go.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to