szehon-ho opened a new pull request, #3457:
URL: https://github.com/apache/iceberg/pull/3457
Expire snapshots can take a long time for large tables with millions of
files and thousands of snapshots/manifests.
One cause is the calculation of files to be deleted. The current algorithm
is:
- Find the reachability graph of all snapshots before expiration
- Find reachability graph of all snapshots after expiration
- Subtract the second from the first, these are files to delete
But this explores every retained snapshot twice. Example: any periodic
expire-snapshot job that expires 1 snapshot needs to explore all n-1 snapshots
twice.
Proposal:
- Find reachability graph of all snapshots after expiration
- Find reachability graph of expired snapshots (if only a few expired,
should be much smaller set)
- Subtract the first from the second, these are files to delete
Implementation: For expired-snapshot scan, change the original spark query
of metadata tables to custom spark jobs that only explore from expired
snapshot(s).
Note: The new expired-snapshot scan duplicates manifestList scan logic to
handle "write.manifest-lists.enabled"="false" flag, but unfortunately the
functionality seems broken without this change and so not possible to test
currently. Added a test for demonstration purpose.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]