ConeyLiu commented on issue #4159: URL: https://github.com/apache/iceberg/issues/4159#issuecomment-1048551526
As @aokolnychyi suggested in [3056](https://github.com/apache/iceberg/pull/3056), we use `DeleteReachableFiles ` to purge table data which could provide much more scalability and performance. While there still some drawbacks that need to consider: 1. Different catalog has a different implementation for drop table. For example, `HadoopCatalog`/`HadoopTables` delete the whole warehouse directly and ignore the purge argument. In this case, we could not use `DeleteReachableFiles`. 2. User self catalog may have some customized features, such as sending event/metrics when purging data. With `DeleteReachableFiles` we will ignore those operations. > I think it should match the removal of reachable files and be consistent in all APIs. Once we know locations owned by the table, we may drop them too. I think this is necessary. We should unify the built-in catalog behavior of the drop table [purge]. And maybe need to define the interface to support some parallel operations (by leveraging distributed engine, such as spark/flink/more). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
