aokolnychyi opened a new issue #4159:
URL: https://github.com/apache/iceberg/issues/4159


   The question about location ownership and file removal comes up in a lot of 
discussions. See 
[here](https://github.com/apache/iceberg/pull/3056#discussion_r804826159) for 
an example.
   
   Right now, our interpretation of `gc.enabled` is not consistent.
   - `HadoopTables` ignores it while purging files.
   - `CatalogUtil` keeps data files if `gc.enabled` is true while purging files 
and removes all metadata files.
   - `DeleteReachableFiles` and other actions throw an exception if 
`gc.enabled` is true.
   
   I think we should make the behavior consistent. My initial thoughts are 
below.
   
   1. One way to interpret disabled garbage collection is to disallow removal 
of data and delete files. For instance, it should not be possible to expire 
snapshots.
   2. We should add a list of location prefixes that are owned by the table to 
our metadata. Until that is done, we can use `gc.enabled` to prohibit dangerous 
actions. For example, `DeleteOrphanFiles` should throw an exception if garbage 
collection is disabled. Once we know what locations owned by the table, we can 
reconsider that check.
   
   Here is a list of places that may physically remove files.
   
   **Expire snapshots**
   
   It should not be possible to expire snapshots if `gc.enabled` is false.
   
   **Delete orphan files**
   
   For now, we should continue to throw an exception if `gc.enabled` is false. 
Once we know what prefixes are owned by the table, we can allow removal of 
orphan files in locations that are owned by the table.
   
   **Delete reachable files**
   
    It shouldn't be possible to delete data and delete files if garbage 
collection is disabled. However, we may consider allowing removal of metadata 
when `gc.enabled` is false. One may argue that metadata files are always owned 
by the table. We should also make our action configurable so that it can delete 
only data or metadata files.
   
   **Drop and purge tables**
   
   I think it should match the removal of reachable files and be consistent 
through all APIs. Once we know locations owned by the table, we may drop them 
too.
   
   cc @jackye1995 @rdblue @pvary @RussellSpitzer @flyrain @szehon-ho 
@danielcweeks @karuppayya 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to