szehon-ho commented on issue #4159:
URL: https://github.com/apache/iceberg/issues/4159#issuecomment-1059409098


   > I think the correct way to run remove orphan files is to do it for the 
entire warehouse.
   
   > I hear your point and I think both use cases are valid. You are right, if 
a single prefix is shared by many tables and object store locations are 
enabled, the only way to remove orphans is by getting a list of all files under 
that prefix and querying systems tables for all Iceberg tables. However, I am 
not sure that's always possible. You have to assume no extra jobs use that 
location, you know all metastores/catalogs, etc. I'd say having a short 
per-table prefix or a set of prefixes would be also quite common. I can be 
convinced otherwise.
   
   This is an interesting idea I also considered at one point, having the 
option to provide the whole listing for S3-based storage bucket, filter out for 
all locations owned by known tables, and run remove orphans on those locations? 
 If you know ahead of time your 'bucket' is majority files iceberg tables, 
might be less expensive overall (though this sounds like a single monster job 
if we are bottlenecked at the speed of physically deleting the files).  Another 
consideration is that system like S3 inventory will be stale in order of hours 
or days.
   
   Though it still seems to go back to the the main question, how to define the 
"owned locations" of a table?  (Sorry, please let me know if there is 
discussion on it already I missed).  That would be great to have, one thing I 
struggled in the past is "alter table location", if the previous locations can 
somehow be saved in a growing list it could allow to run more complete orphan 
removal jobs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to