szehon-ho commented on issue #4159: URL: https://github.com/apache/iceberg/issues/4159#issuecomment-1059409098
> I think the correct way to run remove orphan files is to do it for the entire warehouse. > I hear your point and I think both use cases are valid. You are right, if a single prefix is shared by many tables and object store locations are enabled, the only way to remove orphans is by getting a list of all files under that prefix and querying systems tables for all Iceberg tables. However, I am not sure that's always possible. You have to assume no extra jobs use that location, you know all metastores/catalogs, etc. I'd say having a short per-table prefix or a set of prefixes would be also quite common. I can be convinced otherwise. This is an interesting idea I also considered at one point, having the option to provide the whole listing for S3-based storage bucket, filter out for all locations owned by known tables, and run remove orphans on those locations? If you know ahead of time your 'bucket' is majority files iceberg tables, might be less expensive overall (though this sounds like a single monster job if we are bottlenecked at the speed of physically deleting the files). Another consideration is that system like S3 inventory will be stale in order of hours or days. Though it still seems to go back to the the main question, how to define the "owned locations" of a table? (Sorry, please let me know if there is discussion on it already I missed). That would be great to have, one thing I struggled in the past is "alter table location", if the previous locations can somehow be saved in a growing list it could allow to run more complete orphan removal jobs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
