>
> For RemoveExpiredFiles, I'm admittedly a bit skeptical if it's required
> since orphan file removal should be able to cleanup the files in the
> copied table. Are we able to elaborate why there's a concern with removing
> snapshots on the copied table and subsequently relying on orphan file
> removal on the copied table to remove the actual files? Is it around
> listing?


I have the same concern as Amogh. I already mentioned the same thing in the PR
yesterday
<https://github.com/apache/iceberg/pull/10643#discussion_r1669739401>.
I suggested renaming it as *RemoveTableCopyOrphanFiles. *Thinking more on
this today, I think we should atomically (implicitly) handle cleaning up of
orphan files as part of copy table action instead of a separate action.

Also, very happy to see the progress on this one. This will help users to
move the data from one location to another seamlessly.

- Ajantha


On Wed, Jul 10, 2024 at 7:35 AM Amogh Jahagirdar <2am...@gmail.com> wrote:

> Thanks Yufei!
>
> +1 on having a copy table action, I think that's pretty valuable. I have
> some ideas on interfaces based on previous work I've done for
> region/multi-cloud replication of Iceberg tables. The absolute vs relative
> path discussion is interesting, I have some questions on how relative
> pathing would look like but I'll wait for Anurag's input.
>
> On CheckSnapshotIntegrity, I think I'd probably advocate for having a more
> general "Repair Metadata" procedure. Currently, it looks like
> CheckSnapshotIntegrity just tells a user what files are missing in its
> output. I think we could go a step further and attempt to handle cases
> where a manifest entry refers to a file which no longer exists. We could
> attempt a recovery of that file if the fileIO implementation supports that
> via some sort of a SupportsRecovery mixin. There's also another corruption
> case where duplicate file entries end up in manifests, we can define an
> approach on reconciling that and write out new manifests.
> There's actually been two attempts on this, one from Szehon quite a while
> back https://github.com/apache/iceberg/pull/2608 and another more
> recently from Matt https://github.com/apache/iceberg/pull/10445 . Perhaps
> we could review both of these and figure out a path forward for this?
> For just verifying the integrity of the copy table, we could have a dry
> run option for the repair metadata operation which would output any missing
> files, or manifests with duplicates without performing any recovery/fixing
> up.
>
> For RemoveExpiredFiles, I'm admittedly a bit skeptical if it's required
> since orphan file removal should be able to cleanup the files in the
> copied table. Are we able to elaborate why there's a concern with removing
> snapshots on the copied table and subsequently relying on orphan file
> removal on the copied table to remove the actual files? Is it around
> listing?
>
> Overall this is great to see.
>
> Thanks,
> Amogh Jahagirdar
>
>
>
>
> On Tue, Jul 9, 2024 at 10:59 AM Anurag Mantripragada
> <amantriprag...@apple.com.invalid> wrote:
>
>> Agreed with Peter. I will bring relative paths changes up in the next
>> community sync. I will help drive this.
>>
>>
>> ~ Anurag Mantripragada
>>
>>
>>
>>
>>
>>
>> On Jul 8, 2024, at 10:50 PM, Péter Váry <peter.vary.apa...@gmail.com>
>> wrote:
>>
>> I think in most cases the copy table action doesn't require a query
>> engine to read and generate the new metadata files. This means, that it
>> would be nice to provide a pure Java implementation in the core, and it
>> could be extended/reused by different engines, like Spark, to execute it in
>> a distributed manner, when distributed execution is needed.
>>
>> About the copy vs. relative path debate:
>> - I have seen the relative path requirement coming up multiple times in
>> the past. Seems like a feature requested by multiple users, so I think it
>> would be the best to discuss it in a different thread. The Copy Table
>> Action might be used to move absolute path tables to relative path tables
>> when migration is needed.
>>
>> On Mon, Jul 8, 2024, 21:52 Anurag Mantripragada
>> <amantriprag...@apple.com.invalid> wrote:
>>
>>> Hi Yufei.
>>>
>>> Thanks for the proposal. While the actions are great, they still need to
>>> do a lot of work which can be reduced if we have the relative path changes.
>>> I still support adding these actions as moving data was out of scope for
>>> the relative path design and we can use these actions as helpers when the
>>> spec change is done.
>>>
>>> Anurag Mantripragada
>>>
>>> On Jul 8, 2024, at 10:55 AM, Pucheng Yang <pucheng.yo...@gmail.com>
>>> wrote:
>>>
>>> Thanks for picking this up, I think this is a very valuable addition.
>>>
>>> On Mon, Jul 8, 2024 at 10:48 AM Yufei Gu <flyrain...@gmail.com> wrote:
>>>
>>>> Hi folks,
>>>>
>>>> I'd like to share a recent progress of adding actions to copy tables
>>>> across different places.
>>>>
>>>> There is a constant need to copy tables across different places for
>>>> purposes such as disaster recovery and testing. Due to the absolute file
>>>> paths in Iceberg metadata, it doesn't work automatically. There are three
>>>> generic solutions:
>>>> 1. Rebuild the metadata: This is a proven approach widely used across
>>>> various companies.
>>>> 2. S3 access point: Effective when both the source and target locations
>>>> are in S3, but not applicable to other storage systems.
>>>> 3. Relative path: It requires changes to the table specification.
>>>>
>>>> We focus on the first approach in this thread. While the code has been
>>>> shared 2 years ago here <https://github.com/apache/iceberg/pull/4705>,
>>>> it has never been merged. We picked it up recently. Here are the active PRs
>>>> related to this action. Would really appreciate any feedback and review:
>>>>
>>>>    - PR to add CopyTable action:
>>>>    https://github.com/apache/iceberg/pull/10024
>>>>    - PR to add CheckSnapshotIntegrity action:
>>>>    https://github.com/apache/iceberg/pull/10642
>>>>    - PR to add RemoveExpiredFiles action:
>>>>    https://github.com/apache/iceberg/pull/10643
>>>>
>>>> Here is a google doc with more details to clarify the goals and
>>>> approach:
>>>> https://docs.google.com/document/d/15oPj7ylgWQG8bhk_5aTjzHl7mlc-9f4OAH-oEpKavSc/edit?usp=sharing
>>>>
>>>> Yufei
>>>>
>>>
>>>
>>

Reply via email to