RussellSpitzer commented on pull request #2591: URL: https://github.com/apache/iceberg/pull/2591#issuecomment-845628652
> > We cannot and are not deleting delete files in this action because it's actually much more difficult to find out which delete files are no longer in use than just checking which ones are referred to by the FileScanTasks for the files we are looking at. > > Yes I am aware of this condition, but the delete files are applied for each file scan task anyway, it's just we cannot remove it because of the condition you described, and we have to call another action, do double work to fully remove the file. Conversely, say we have another action to only remove delete files, then we are reading those delete files anyway, and it also feels wasteful to me that we have to do another bin pack after deleting those files to make the files more optimized, and potentially cause more commit conflicts. They are only applied to the files we are looking at. It's not the reading of the delete files that is expensive, to determine if a delete file isn't needed we have to scan through *every* data file they may apply to. After the bin pack we actually remove the files that are rewritten from the pool of files the Delete files may apply too but there is no way to check the rest without reading every file a delete file applies too. That's why it's completely independent, you don't actually do any work while rewriting files that benefits the delete file situation for other files. Delete file A may touch Files A, B , C ,D We plan for files A and B After running binpack we Read A and B and write A' and B' we never touch C and D. Delete File A may now only touch C, and D -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
