amogh-jahagirdar opened a new issue, #5653:
URL: https://github.com/apache/iceberg/issues/5653

   ### Feature Request / Improvement
   
   After the change in https://github.com/apache/iceberg/pull/4578 for updating 
the expire snapshots procedure to respect retention policies for branching and 
tagging, one significant limitation is that incremental file deletion as part 
of the procedure cannot be performed. This is because branching itself does not 
have visibility on what files can be removed; a reference set of "reachable" 
files has to be built from the metadata tree. 
   
   In previous community syncs this issue has come up, and wanted to discuss 
the approach for this:
   
   1.) Update the remove snapshots API implementation to build an in-memory 
reference set of reachable files across the retained branch snapshots and tags. 
This does pose a problem for large tables where the list of files would be too 
large to retain in memory on a single node, which brings us to point 2
   
   2.) For users with really large tables, as discussed in a previous community 
sync, it can be reasonably assumed that they have Spark infrastructure for 
running an effective distributed procedure. Currently the Spark Procedure 
performs the metadata removal for removing snapshots 
https://github.com/apache/iceberg/blob/master/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/ExpireSnapshotsSparkAction.java#L185,
 and the spark action itself takes the responsibility of doing an anti-join of 
the reachable files before and after the expiration, and the subsequent 
deletion.
   
   The Spark procedure could also be updated for a better distributed procedure 
in the context of branching and tagging. We could refer (conceptually) to what 
Nessie is doing 
https://github.com/projectnessie/nessie/blob/main/gc/gc-base/src/main/java/org/projectnessie/gc/base/GCImpl.java#L58
 for its Garbage collection implementation.
   
   If there is consensus in the community on this plan, I'll start the 
implementation
   
   CC: @rdblue @jackye1995 @namrathamyske @aokolnychyi @RussellSpitzer 
   
   ### Query engine
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to