RussellSpitzer opened a new pull request #1211: URL: https://github.com/apache/iceberg/pull/1211
Hi everyone, This is a work in progress on an idea to speed up and scale Manifest scanning during snapshot expiration. I would be glad to hear any thoughts anyone has on the concept or implementation. Thanks for your feedback! Adds a Spark Action with the aim of parallelizing the manifest scanning portion of Snapshot expiration. Previously there was only a single method for expiring old data files associated with expired Snapshots which required scanning all effected manifests locally. In order to take advantage of systems which can handle more simultaneous requests and IO we move the Manifest Scanning portiion of the expiration to Spark. The new functionality is accesible in a new Spark Action ExpireSnapshotsAction which has a similiar api to the local task but is instead exeuted on Spark. The new action is implemented by utilizing the local code to determine which manifest fils are effected by Snapshot Expiration. Then parallelizing the file names and performing the scanning of the manifest files remotely. The actual deletion of unneeded data files is still performed locally. To get the information required for performing the deletes, the Remove Snapshot class is refactored so that the methods relating to discovering effected manifests can be called by other modules. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
