aokolnychyi opened a new issue #4159: URL: https://github.com/apache/iceberg/issues/4159
The question about location ownership and file removal comes up in a lot of discussions. See [here](https://github.com/apache/iceberg/pull/3056#discussion_r804826159) for an example. Right now, our interpretation of `gc.enabled` is not consistent. - `HadoopTables` ignores it while purging files. - `CatalogUtil` keeps data files if `gc.enabled` is true while purging files and removes all metadata files. - `DeleteReachableFiles` and other actions throw an exception if `gc.enabled` is true. I think we should make the behavior consistent. My initial thoughts are below. 1. One way to interpret disabled garbage collection is to disallow removal of data and delete files. For instance, it should not be possible to expire snapshots. 2. We should add a list of location prefixes that are owned by the table to our metadata. Until that is done, we can use `gc.enabled` to prohibit dangerous actions. For example, `DeleteOrphanFiles` should throw an exception if garbage collection is disabled. Once we know what locations owned by the table, we can reconsider that check. Here is a list of places that may physically remove files. **Expire snapshots** It should not be possible to expire snapshots if `gc.enabled` is false. **Delete orphan files** For now, we should continue to throw an exception if `gc.enabled` is false. Once we know what prefixes are owned by the table, we can allow removal of orphan files in locations that are owned by the table. **Delete reachable files** It shouldn't be possible to delete data and delete files if garbage collection is disabled. However, we may consider allowing removal of metadata when `gc.enabled` is false. One may argue that metadata files are always owned by the table. We should also make our action configurable so that it can delete only data or metadata files. **Drop and purge tables** I think it should match the removal of reachable files and be consistent through all APIs. Once we know locations owned by the table, we may drop them too. cc @jackye1995 @rdblue @pvary @RussellSpitzer @flyrain @szehon-ho @danielcweeks @karuppayya -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
