aokolnychyi commented on issue #4159: URL: https://github.com/apache/iceberg/issues/4159#issuecomment-1048440808
> The current behavior in CatalogUtil of keeping data while removing metadata feels quite odd to me. If metadata is removed maybe the result data is still useful and can be reconstructed as a Hive table, but when object storage mode is enabled, it's basically not possible to track daown the file locations, making everything just orphan files. That's generally true but there is a valid use case for keeping data files: you may use SNAPSHOT command in Spark that would create an Iceberg table pointing to non-Iceberg files. Those referenced data files may belong to other prod Hive tables. We allow to create Iceberg metadata to play with that data but we can't remove the imported data files. It may corrupt original tables. > I think the correct way to run remove orphan files is to do it for the entire warehouse. I hear your point and I think both use cases are valid. You are right, if a single prefix is shared by many tables and object store locations are enabled, the only way to remove orphans is by getting a list of all files under that prefix and querying systems tables for all Iceberg tables. However, I am not sure that's always possible. You have to assume no extra jobs use that location, you know all metastores/catalogs, etc. I'd say having a short per-table prefix or a set of prefixes would be also quite common. I can be convinced otherwise. I'd be interested to hear more from other folks too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
