[GitHub] [iceberg] szehon-ho opened a new pull request, #3457: Spark: Improve performance of expire snapshot by not double-scanning retained Snapshots

GitBox Fri, 08 Jul 2022 16:03:28 -0700


szehon-ho opened a new pull request, #3457:
URL: https://github.com/apache/iceberg/pull/3457


   Expire snapshots can take a long time for large tables with millions of 
files and thousands of snapshots/manifests.
   
   One cause is the calculation of files to be deleted.  The current algorithm 
is:
   
   -  Find the reachability graph of all snapshots before expiration
   -  Find reachability graph of all snapshots after expiration
   -  Subtract the second from the first, these are files to delete
    
   But this explores every retained snapshot twice.  Example: any periodic 
expire-snapshot job that expires 1 snapshot needs to explore all n-1 snapshots 
twice.
   
   Proposal:
   
   - Find reachability graph of all snapshots after expiration
   - Find reachability graph of expired snapshots (if only a few expired, 
should be much smaller set)
   - Subtract the first from the second, these are files to delete
   
   Implementation: For expired-snapshot scan, change the original spark query 
of metadata tables to custom spark jobs that only explore from expired 
snapshot(s).
   
   Note: The new expired-snapshot scan duplicates manifestList scan logic to 
handle "write.manifest-lists.enabled"="false" flag, but unfortunately the 
functionality seems broken without this change and so not possible to test 
currently.  Added a test for demonstration purpose.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] szehon-ho opened a new pull request, #3457: Spark: Improve performance of expire snapshot by not double-scanning retained Snapshots

Reply via email to