Agreed with Peter. I will bring relative paths changes up in the next community sync. I will help drive this.
~ Anurag Mantripragada > On Jul 8, 2024, at 10:50 PM, Péter Váry <[email protected]> wrote: > > I think in most cases the copy table action doesn't require a query engine to > read and generate the new metadata files. This means, that it would be nice > to provide a pure Java implementation in the core, and it could be > extended/reused by different engines, like Spark, to execute it in a > distributed manner, when distributed execution is needed. > > About the copy vs. relative path debate: > - I have seen the relative path requirement coming up multiple times in the > past. Seems like a feature requested by multiple users, so I think it would > be the best to discuss it in a different thread. The Copy Table Action might > be used to move absolute path tables to relative path tables when migration > is needed. > > On Mon, Jul 8, 2024, 21:52 Anurag Mantripragada > <[email protected]> wrote: >> Hi Yufei. >> >> Thanks for the proposal. While the actions are great, they still need to do >> a lot of work which can be reduced if we have the relative path changes. I >> still support adding these actions as moving data was out of scope for the >> relative path design and we can use these actions as helpers when the spec >> change is done. >> >> Anurag Mantripragada >> >>> On Jul 8, 2024, at 10:55 AM, Pucheng Yang <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> Thanks for picking this up, I think this is a very valuable addition. >>> >>> On Mon, Jul 8, 2024 at 10:48 AM Yufei Gu <[email protected] >>> <mailto:[email protected]>> wrote: >>>> Hi folks, >>>> >>>> I'd like to share a recent progress of adding actions to copy tables >>>> across different places. >>>> >>>> There is a constant need to copy tables across different places for >>>> purposes such as disaster recovery and testing. Due to the absolute file >>>> paths in Iceberg metadata, it doesn't work automatically. There are three >>>> generic solutions: >>>> 1. Rebuild the metadata: This is a proven approach widely used across >>>> various companies. >>>> 2. S3 access point: Effective when both the source and target locations >>>> are in S3, but not applicable to other storage systems. >>>> 3. Relative path: It requires changes to the table specification. >>>> >>>> We focus on the first approach in this thread. While the code has been >>>> shared 2 years ago here <https://github.com/apache/iceberg/pull/4705>, it >>>> has never been merged. We picked it up recently. Here are the active PRs >>>> related to this action. Would really appreciate any feedback and review: >>>> PR to add CopyTable action: https://github.com/apache/iceberg/pull/10024 >>>> PR to add CheckSnapshotIntegrity action: >>>> https://github.com/apache/iceberg/pull/10642 >>>> PR to add RemoveExpiredFiles >>>> action:https://github.com/apache/iceberg/pull/10643 >>>> Here is a google doc with more details to clarify the goals and approach: >>>> https://docs.google.com/document/d/15oPj7ylgWQG8bhk_5aTjzHl7mlc-9f4OAH-oEpKavSc/edit?usp=sharing >>>> >>>> Yufei >>
