Thanks Yufei!

+1 on having a copy table action, I think that's pretty valuable. I have
some ideas on interfaces based on previous work I've done for
region/multi-cloud replication of Iceberg tables. The absolute vs relative
path discussion is interesting, I have some questions on how relative
pathing would look like but I'll wait for Anurag's input.

On CheckSnapshotIntegrity, I think I'd probably advocate for having a more
general "Repair Metadata" procedure. Currently, it looks like
CheckSnapshotIntegrity just tells a user what files are missing in its
output. I think we could go a step further and attempt to handle cases
where a manifest entry refers to a file which no longer exists. We could
attempt a recovery of that file if the fileIO implementation supports that
via some sort of a SupportsRecovery mixin. There's also another corruption
case where duplicate file entries end up in manifests, we can define an
approach on reconciling that and write out new manifests.
There's actually been two attempts on this, one from Szehon quite a while
back https://github.com/apache/iceberg/pull/2608 and another more recently
from Matt https://github.com/apache/iceberg/pull/10445 . Perhaps we could
review both of these and figure out a path forward for this?
For just verifying the integrity of the copy table, we could have a dry run
option for the repair metadata operation which would output any missing
files, or manifests with duplicates without performing any recovery/fixing
up.

For RemoveExpiredFiles, I'm admittedly a bit skeptical if it's required
since orphan file removal should be able to cleanup the files in the
copied table. Are we able to elaborate why there's a concern with removing
snapshots on the copied table and subsequently relying on orphan file
removal on the copied table to remove the actual files? Is it around
listing?

Overall this is great to see.

Thanks,
Amogh Jahagirdar




On Tue, Jul 9, 2024 at 10:59 AM Anurag Mantripragada
<amantriprag...@apple.com.invalid> wrote:

> Agreed with Peter. I will bring relative paths changes up in the next
> community sync. I will help drive this.
>
>
> ~ Anurag Mantripragada
>
>
>
>
>
>
> On Jul 8, 2024, at 10:50 PM, Péter Váry <peter.vary.apa...@gmail.com>
> wrote:
>
> I think in most cases the copy table action doesn't require a query engine
> to read and generate the new metadata files. This means, that it would be
> nice to provide a pure Java implementation in the core, and it could be
> extended/reused by different engines, like Spark, to execute it in a
> distributed manner, when distributed execution is needed.
>
> About the copy vs. relative path debate:
> - I have seen the relative path requirement coming up multiple times in
> the past. Seems like a feature requested by multiple users, so I think it
> would be the best to discuss it in a different thread. The Copy Table
> Action might be used to move absolute path tables to relative path tables
> when migration is needed.
>
> On Mon, Jul 8, 2024, 21:52 Anurag Mantripragada
> <amantriprag...@apple.com.invalid> wrote:
>
>> Hi Yufei.
>>
>> Thanks for the proposal. While the actions are great, they still need to
>> do a lot of work which can be reduced if we have the relative path changes.
>> I still support adding these actions as moving data was out of scope for
>> the relative path design and we can use these actions as helpers when the
>> spec change is done.
>>
>> Anurag Mantripragada
>>
>> On Jul 8, 2024, at 10:55 AM, Pucheng Yang <pucheng.yo...@gmail.com>
>> wrote:
>>
>> Thanks for picking this up, I think this is a very valuable addition.
>>
>> On Mon, Jul 8, 2024 at 10:48 AM Yufei Gu <flyrain...@gmail.com> wrote:
>>
>>> Hi folks,
>>>
>>> I'd like to share a recent progress of adding actions to copy tables
>>> across different places.
>>>
>>> There is a constant need to copy tables across different places for
>>> purposes such as disaster recovery and testing. Due to the absolute file
>>> paths in Iceberg metadata, it doesn't work automatically. There are three
>>> generic solutions:
>>> 1. Rebuild the metadata: This is a proven approach widely used across
>>> various companies.
>>> 2. S3 access point: Effective when both the source and target locations
>>> are in S3, but not applicable to other storage systems.
>>> 3. Relative path: It requires changes to the table specification.
>>>
>>> We focus on the first approach in this thread. While the code has been
>>> shared 2 years ago here <https://github.com/apache/iceberg/pull/4705>,
>>> it has never been merged. We picked it up recently. Here are the active PRs
>>> related to this action. Would really appreciate any feedback and review:
>>>
>>>    - PR to add CopyTable action:
>>>    https://github.com/apache/iceberg/pull/10024
>>>    - PR to add CheckSnapshotIntegrity action:
>>>    https://github.com/apache/iceberg/pull/10642
>>>    - PR to add RemoveExpiredFiles action:
>>>    https://github.com/apache/iceberg/pull/10643
>>>
>>> Here is a google doc with more details to clarify the goals and
>>> approach:
>>> https://docs.google.com/document/d/15oPj7ylgWQG8bhk_5aTjzHl7mlc-9f4OAH-oEpKavSc/edit?usp=sharing
>>>
>>> Yufei
>>>
>>
>>
>

Reply via email to