Hi Dan,
>
> The paths are not unique across snapshots 1/2, but within each snapshot
> they are.


This is exactly the contract I was trying to determine.  It seems like it
could potentially be clarified in the specification (again apologies if I
missed it). I opened a PR to try to add language around these requirements,
hopefully this is a helpful contribution.

Thanks,
Micah

[1] https://github.com/apache/iceberg/pull/4272

On Fri, Mar 4, 2022 at 4:41 PM Daniel Weeks <daniel.c.we...@gmail.com>
wrote:

> I think in the situation you're demonstrating, the manifests are separated
> across two separate snapshots.
>
> Here's an example:
>
> create table t1 (s string);
> insert into t1 values ('foo');  -- snapshot 0, manifest-list with 1
> manifest pointing to file A (ADDED)
> insert into t1 values ('bar'); -- snapshot 1, manifest-list with 2
> manifests pointing to file A (ADDED),  file B (ADDED)
> delete from t1 where s = 'foo'; -- snapshot 2, manifest-list with 2
> manifests pointing to file A (DELETED),  file B (ADDED)
>
> The paths are not unique across snapshots 1/2, but within each snapshot
> they are.
>
> Now in the same case if the data was in the same file, you would have a
> rewrite of the datafile like this (assuming no row-level deletes):
>
> create table t1 (s string);
> insert into t1 values ('foo'), ('bar');  -- snapshot 0, manifest-list with
> 1 manifest pointing to file A (ADDED)
> delete from t1 where s = 'foo'; -- snapshot 1, manifest-list with 1
> manifests pointing to file A (DELETED) + file B (ADDED)
>
> I hope I'm understanding your example correctly, but let me know if I'm
> off track here.
>
> Thanks,
> Dan
>
>
>
> On Fri, Mar 4, 2022 at 2:23 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
>> Hi Dan,
>> Thanks for the quick reply.
>>
>>
>>> For #2, the answer follows mostly because if the answer to #1 holds,
>>> then yes the pairwise intersection of entries in the manifest files of a
>>> given snapshot is empty.
>>
>>
>> Just to be pedantic, even with unique file names.  It seems one could
>> construct a snapshots as:
>> Manifest 1: Add File A
>> Manifest 2: Delete File A
>>
>> From your answer it sounds like this is unexpected and readers generally
>> don't try to reconcile Deletes add Adds?
>>
>> Thanks,
>> Micah
>>
>> On Fri, Mar 4, 2022 at 2:10 PM Daniel Weeks <dwe...@apache.org> wrote:
>>
>>> Hey Micah,
>>>
>>> For #1, I don't believe spec clearly calls out that all data/delete
>>> files must be unique, but the requirements for cleanup would be violated in
>>> certain cases if you had the same file referenced in multiple manifests.
>>> In practice, the best way to ensure data correctness and metadata
>>> consistency is to ensure that all referenced files have unique locations
>>> and that those locations do not get overwritten.
>>>
>>> For #2, the answer follows mostly because if the answer to #1 holds,
>>> then yes the pairwise intersection of entries in the manifest files of a
>>> given snapshot is empty.
>>>
>>> The java library does perform some checks to prevent a file from being
>>> added to the same manifest multiple times, but I don't think that
>>> extends to all possible ways of adding files.  So it may be possible, but
>>> not a good idea.
>>>
>>> Sam might know if there's a way to add a nav for the format page (it is
>>> a little difficult to navigate at the moment).
>>>
>>> -Dan
>>>
>>> On Thu, Mar 3, 2022 at 4:49 PM Micah Kornfield <emkornfi...@gmail.com>
>>> wrote:
>>>
>>>> Hi Iceberg Dev,
>>>> I tried searching for it in the specification but couldn't find
>>>> anything explicit:
>>>>
>>>> 1.  Is it assumed that all data files and delete files will always have
>>>> globally unique names in a table?
>>>> 2.  Is it expected that the pairwise intersection of all manifest files
>>>> in a snapshot is empty (i.e. For any given data file it has exactly zero or
>>>> 1 entries across all manifest files in a snapshot)?
>>>>
>>>> I think the uniqueness of both can maybe be inferred by this sentence
>>>> (but I'm not 100% sure):
>>>>
>>>>> When a file is replaced or deleted from the dataset, it’s manifest
>>>>> entry fields store the snapshot ID in which the file was deleted and 
>>>>> status
>>>>> 2 (deleted). The file may be deleted from the file system when the 
>>>>> snapshot
>>>>> in which it was deleted is garbage collected, assuming that older 
>>>>> snapshots
>>>>> have also been garbage collected [1].
>>>>
>>>>
>>>> Thanks,
>>>> Micah
>>>>
>>>>
>>>> P.S. Is there a way to add a table of contents to the specification.  I
>>>> might be missing it but I don't see one rendered at:
>>>> https://iceberg.apache.org/spec/
>>>>
>>>

Reply via email to