amogh-jahagirdar opened a new issue, #5653: URL: https://github.com/apache/iceberg/issues/5653
### Feature Request / Improvement After the change in https://github.com/apache/iceberg/pull/4578 for updating the expire snapshots procedure to respect retention policies for branching and tagging, one significant limitation is that incremental file deletion as part of the procedure cannot be performed. This is because branching itself does not have visibility on what files can be removed; a reference set of "reachable" files has to be built from the metadata tree. In previous community syncs this issue has come up, and wanted to discuss the approach for this: 1.) Update the remove snapshots API implementation to build an in-memory reference set of reachable files across the retained branch snapshots and tags. This does pose a problem for large tables where the list of files would be too large to retain in memory on a single node, which brings us to point 2 2.) For users with really large tables, as discussed in a previous community sync, it can be reasonably assumed that they have Spark infrastructure for running an effective distributed procedure. Currently the Spark Procedure performs the metadata removal for removing snapshots https://github.com/apache/iceberg/blob/master/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/ExpireSnapshotsSparkAction.java#L185, and the spark action itself takes the responsibility of doing an anti-join of the reachable files before and after the expiration, and the subsequent deletion. The Spark procedure could also be updated for a better distributed procedure in the context of branching and tagging. We could refer (conceptually) to what Nessie is doing https://github.com/projectnessie/nessie/blob/main/gc/gc-base/src/main/java/org/projectnessie/gc/base/GCImpl.java#L58 for its Garbage collection implementation. If there is consensus in the community on this plan, I'll start the implementation CC: @rdblue @jackye1995 @namrathamyske @aokolnychyi @RussellSpitzer ### Query engine _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
