Hi folks,

I'd like to share a recent progress of adding actions to copy tables across
different places.

There is a constant need to copy tables across different places for
purposes such as disaster recovery and testing. Due to the absolute file
paths in Iceberg metadata, it doesn't work automatically. There are three
generic solutions:
1. Rebuild the metadata: This is a proven approach widely used across
various companies.
2. S3 access point: Effective when both the source and target locations are
in S3, but not applicable to other storage systems.
3. Relative path: It requires changes to the table specification.

We focus on the first approach in this thread. While the code has been
shared 2 years ago here <https://github.com/apache/iceberg/pull/4705>, it
has never been merged. We picked it up recently. Here are the active PRs
related to this action. Would really appreciate any feedback and review:

   - PR to add CopyTable action:
   https://github.com/apache/iceberg/pull/10024
   - PR to add CheckSnapshotIntegrity action:
   https://github.com/apache/iceberg/pull/10642
   - PR to add RemoveExpiredFiles action:
   https://github.com/apache/iceberg/pull/10643

Here is a google doc with more details to clarify the goals and approach:
https://docs.google.com/document/d/15oPj7ylgWQG8bhk_5aTjzHl7mlc-9f4OAH-oEpKavSc/edit?usp=sharing

Yufei

Reply via email to