RussellSpitzer opened a new pull request #1211:
URL: https://github.com/apache/iceberg/pull/1211


   Hi everyone, This is a work in progress on an idea to speed up and scale 
Manifest scanning during
   snapshot expiration. I would be glad to hear any thoughts anyone has on the 
concept or implementation.
   
   Thanks for your feedback!
   
   
   Adds a Spark Action with the aim of parallelizing the manifest scanning
   portion of Snapshot expiration.
   
   Previously there was only a single method for expiring old data files 
associated
   with expired Snapshots which required scanning all effected manifests 
locally. In
   order to take advantage of systems which can handle more simultaneous 
requests and IO
   we move the Manifest Scanning portiion of the expiration to Spark. The new 
functionality
   is accesible in a new Spark Action ExpireSnapshotsAction which has a 
similiar api to
   the local task but is instead exeuted on Spark.
   
   The new action is implemented by utilizing the local code to determine which 
manifest
   fils are effected by Snapshot Expiration. Then parallelizing the file names 
and performing
   the scanning of the manifest files remotely. The actual deletion of unneeded 
data files is still
   performed locally.
   
   To get the information required for performing the deletes, the Remove 
Snapshot class is refactored
   so that the methods relating to discovering effected manifests can be called 
by other modules.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to