RussellSpitzer edited a comment on pull request #2591: URL: https://github.com/apache/iceberg/pull/2591#issuecomment-845628652
> > We cannot and are not deleting delete files in this action because it's actually much more difficult to find out which delete files are no longer in use than just checking which ones are referred to by the FileScanTasks for the files we are looking at. > > Yes I am aware of this condition, but the delete files are applied for each file scan task anyway, it's just we cannot remove it because of the condition you described, and we have to call another action, do double work to fully remove the file. Conversely, say we have another action to only remove delete files, then we are reading those delete files anyway, and it also feels wasteful to me that we have to do another bin pack after deleting those files to make the files more optimized, and potentially cause more commit conflicts. They are only applied to the files we are looking at. It's not the reading of the delete files that is expensive, to determine if a delete file isn't needed we have to scan through *every* data file they may apply to. For example Delete file A may touch Files A, B , C ,D We plan for files A and B After running binpack we Read A and B and write A' and B' we never touch C and D. Delete File A may now only touch C, and D If we now want to check if we can remove Delete File A we only have to read files C and D so we actually made progress. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
