aokolnychyi commented on issue #4159:
URL: https://github.com/apache/iceberg/issues/4159#issuecomment-1048440808


   > The current behavior in CatalogUtil of keeping data while removing 
metadata feels quite odd to me. If metadata is removed maybe the result data is 
still useful and can be reconstructed as a Hive table, but when object storage 
mode is enabled, it's basically not possible to track daown the file locations, 
making everything just orphan files.
   
   That's generally true but there is a valid use case for keeping data files: 
you may use SNAPSHOT command in Spark that would create an Iceberg table 
pointing to non-Iceberg files. Those referenced data files may belong to other 
prod Hive tables. We allow to create Iceberg metadata to play with that data 
but we can't remove the imported data files. It may corrupt original tables.
   
   > I think the correct way to run remove orphan files is to do it for the 
entire warehouse.
   
   I hear your point and I think both use cases are valid. You are right, if a 
single prefix is shared by many tables and object store locations are enabled, 
the only way to remove orphans is by getting a list of all files under that 
prefix and querying systems tables for all Iceberg tables. However, I am not 
sure that's always possible. You have to assume no extra jobs use that 
location, you know all metastores/catalogs, etc. I'd say having a short 
per-table prefix or a set of prefixes would be also quite common. I can be 
convinced otherwise.
   
   I'd be interested to hear more from other folks too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to