> > For RemoveExpiredFiles, I'm admittedly a bit skeptical if it's required > since orphan file removal should be able to cleanup the files in the > copied table. Are we able to elaborate why there's a concern with removing > snapshots on the copied table and subsequently relying on orphan file > removal on the copied table to remove the actual files? Is it around > listing?
I have the same concern as Amogh. I already mentioned the same thing in the PR yesterday <https://github.com/apache/iceberg/pull/10643#discussion_r1669739401>. I suggested renaming it as *RemoveTableCopyOrphanFiles. *Thinking more on this today, I think we should atomically (implicitly) handle cleaning up of orphan files as part of copy table action instead of a separate action. Also, very happy to see the progress on this one. This will help users to move the data from one location to another seamlessly. - Ajantha On Wed, Jul 10, 2024 at 7:35 AM Amogh Jahagirdar <2am...@gmail.com> wrote: > Thanks Yufei! > > +1 on having a copy table action, I think that's pretty valuable. I have > some ideas on interfaces based on previous work I've done for > region/multi-cloud replication of Iceberg tables. The absolute vs relative > path discussion is interesting, I have some questions on how relative > pathing would look like but I'll wait for Anurag's input. > > On CheckSnapshotIntegrity, I think I'd probably advocate for having a more > general "Repair Metadata" procedure. Currently, it looks like > CheckSnapshotIntegrity just tells a user what files are missing in its > output. I think we could go a step further and attempt to handle cases > where a manifest entry refers to a file which no longer exists. We could > attempt a recovery of that file if the fileIO implementation supports that > via some sort of a SupportsRecovery mixin. There's also another corruption > case where duplicate file entries end up in manifests, we can define an > approach on reconciling that and write out new manifests. > There's actually been two attempts on this, one from Szehon quite a while > back https://github.com/apache/iceberg/pull/2608 and another more > recently from Matt https://github.com/apache/iceberg/pull/10445 . Perhaps > we could review both of these and figure out a path forward for this? > For just verifying the integrity of the copy table, we could have a dry > run option for the repair metadata operation which would output any missing > files, or manifests with duplicates without performing any recovery/fixing > up. > > For RemoveExpiredFiles, I'm admittedly a bit skeptical if it's required > since orphan file removal should be able to cleanup the files in the > copied table. Are we able to elaborate why there's a concern with removing > snapshots on the copied table and subsequently relying on orphan file > removal on the copied table to remove the actual files? Is it around > listing? > > Overall this is great to see. > > Thanks, > Amogh Jahagirdar > > > > > On Tue, Jul 9, 2024 at 10:59 AM Anurag Mantripragada > <amantriprag...@apple.com.invalid> wrote: > >> Agreed with Peter. I will bring relative paths changes up in the next >> community sync. I will help drive this. >> >> >> ~ Anurag Mantripragada >> >> >> >> >> >> >> On Jul 8, 2024, at 10:50 PM, Péter Váry <peter.vary.apa...@gmail.com> >> wrote: >> >> I think in most cases the copy table action doesn't require a query >> engine to read and generate the new metadata files. This means, that it >> would be nice to provide a pure Java implementation in the core, and it >> could be extended/reused by different engines, like Spark, to execute it in >> a distributed manner, when distributed execution is needed. >> >> About the copy vs. relative path debate: >> - I have seen the relative path requirement coming up multiple times in >> the past. Seems like a feature requested by multiple users, so I think it >> would be the best to discuss it in a different thread. The Copy Table >> Action might be used to move absolute path tables to relative path tables >> when migration is needed. >> >> On Mon, Jul 8, 2024, 21:52 Anurag Mantripragada >> <amantriprag...@apple.com.invalid> wrote: >> >>> Hi Yufei. >>> >>> Thanks for the proposal. While the actions are great, they still need to >>> do a lot of work which can be reduced if we have the relative path changes. >>> I still support adding these actions as moving data was out of scope for >>> the relative path design and we can use these actions as helpers when the >>> spec change is done. >>> >>> Anurag Mantripragada >>> >>> On Jul 8, 2024, at 10:55 AM, Pucheng Yang <pucheng.yo...@gmail.com> >>> wrote: >>> >>> Thanks for picking this up, I think this is a very valuable addition. >>> >>> On Mon, Jul 8, 2024 at 10:48 AM Yufei Gu <flyrain...@gmail.com> wrote: >>> >>>> Hi folks, >>>> >>>> I'd like to share a recent progress of adding actions to copy tables >>>> across different places. >>>> >>>> There is a constant need to copy tables across different places for >>>> purposes such as disaster recovery and testing. Due to the absolute file >>>> paths in Iceberg metadata, it doesn't work automatically. There are three >>>> generic solutions: >>>> 1. Rebuild the metadata: This is a proven approach widely used across >>>> various companies. >>>> 2. S3 access point: Effective when both the source and target locations >>>> are in S3, but not applicable to other storage systems. >>>> 3. Relative path: It requires changes to the table specification. >>>> >>>> We focus on the first approach in this thread. While the code has been >>>> shared 2 years ago here <https://github.com/apache/iceberg/pull/4705>, >>>> it has never been merged. We picked it up recently. Here are the active PRs >>>> related to this action. Would really appreciate any feedback and review: >>>> >>>> - PR to add CopyTable action: >>>> https://github.com/apache/iceberg/pull/10024 >>>> - PR to add CheckSnapshotIntegrity action: >>>> https://github.com/apache/iceberg/pull/10642 >>>> - PR to add RemoveExpiredFiles action: >>>> https://github.com/apache/iceberg/pull/10643 >>>> >>>> Here is a google doc with more details to clarify the goals and >>>> approach: >>>> https://docs.google.com/document/d/15oPj7ylgWQG8bhk_5aTjzHl7mlc-9f4OAH-oEpKavSc/edit?usp=sharing >>>> >>>> Yufei >>>> >>> >>> >>