Yeah, I follow what you're saying now and it's an excellent observation.  I
agree that a clarification would help because there are a number of places
where the behavior is somewhat implied, but not clearly stated.

I can't think of a case where you would want to have the same file either
duplicated in a manifest or across manifests with the same status (and I
can think of some problematic situations that can arise from that).

The cases where we have seen duplication have typically been due to
implementation issues that result in duplicates, which is why there are
some checks in place.

I'll follow up on this,

Thanks Micah!



On Fri, Mar 4, 2022 at 10:28 PM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> Hi Dan,
>>
>> The paths are not unique across snapshots 1/2, but within each snapshot
>> they are.
>
>
> This is exactly the contract I was trying to determine.  It seems like it
> could potentially be clarified in the specification (again apologies if I
> missed it). I opened a PR to try to add language around these requirements,
> hopefully this is a helpful contribution.
>
> Thanks,
> Micah
>
> [1] https://github.com/apache/iceberg/pull/4272
>
> On Fri, Mar 4, 2022 at 4:41 PM Daniel Weeks <daniel.c.we...@gmail.com>
> wrote:
>
>> I think in the situation you're demonstrating, the manifests are
>> separated across two separate snapshots.
>>
>> Here's an example:
>>
>> create table t1 (s string);
>> insert into t1 values ('foo');  -- snapshot 0, manifest-list with 1
>> manifest pointing to file A (ADDED)
>> insert into t1 values ('bar'); -- snapshot 1, manifest-list with 2
>> manifests pointing to file A (ADDED),  file B (ADDED)
>> delete from t1 where s = 'foo'; -- snapshot 2, manifest-list with 2
>> manifests pointing to file A (DELETED),  file B (ADDED)
>>
>> The paths are not unique across snapshots 1/2, but within each snapshot
>> they are.
>>
>> Now in the same case if the data was in the same file, you would have a
>> rewrite of the datafile like this (assuming no row-level deletes):
>>
>> create table t1 (s string);
>> insert into t1 values ('foo'), ('bar');  -- snapshot 0, manifest-list
>> with 1 manifest pointing to file A (ADDED)
>> delete from t1 where s = 'foo'; -- snapshot 1, manifest-list with 1
>> manifests pointing to file A (DELETED) + file B (ADDED)
>>
>> I hope I'm understanding your example correctly, but let me know if I'm
>> off track here.
>>
>> Thanks,
>> Dan
>>
>>
>>
>> On Fri, Mar 4, 2022 at 2:23 PM Micah Kornfield <emkornfi...@gmail.com>
>> wrote:
>>
>>> Hi Dan,
>>> Thanks for the quick reply.
>>>
>>>
>>>> For #2, the answer follows mostly because if the answer to #1 holds,
>>>> then yes the pairwise intersection of entries in the manifest files of a
>>>> given snapshot is empty.
>>>
>>>
>>> Just to be pedantic, even with unique file names.  It seems one could
>>> construct a snapshots as:
>>> Manifest 1: Add File A
>>> Manifest 2: Delete File A
>>>
>>> From your answer it sounds like this is unexpected and readers generally
>>> don't try to reconcile Deletes add Adds?
>>>
>>> Thanks,
>>> Micah
>>>
>>> On Fri, Mar 4, 2022 at 2:10 PM Daniel Weeks <dwe...@apache.org> wrote:
>>>
>>>> Hey Micah,
>>>>
>>>> For #1, I don't believe spec clearly calls out that all data/delete
>>>> files must be unique, but the requirements for cleanup would be violated in
>>>> certain cases if you had the same file referenced in multiple manifests.
>>>> In practice, the best way to ensure data correctness and metadata
>>>> consistency is to ensure that all referenced files have unique locations
>>>> and that those locations do not get overwritten.
>>>>
>>>> For #2, the answer follows mostly because if the answer to #1 holds,
>>>> then yes the pairwise intersection of entries in the manifest files of a
>>>> given snapshot is empty.
>>>>
>>>> The java library does perform some checks to prevent a file from being
>>>> added to the same manifest multiple times, but I don't think that
>>>> extends to all possible ways of adding files.  So it may be possible, but
>>>> not a good idea.
>>>>
>>>> Sam might know if there's a way to add a nav for the format page (it is
>>>> a little difficult to navigate at the moment).
>>>>
>>>> -Dan
>>>>
>>>> On Thu, Mar 3, 2022 at 4:49 PM Micah Kornfield <emkornfi...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Iceberg Dev,
>>>>> I tried searching for it in the specification but couldn't find
>>>>> anything explicit:
>>>>>
>>>>> 1.  Is it assumed that all data files and delete files will always
>>>>> have globally unique names in a table?
>>>>> 2.  Is it expected that the pairwise intersection of all manifest
>>>>> files in a snapshot is empty (i.e. For any given data file it has exactly
>>>>> zero or 1 entries across all manifest files in a snapshot)?
>>>>>
>>>>> I think the uniqueness of both can maybe be inferred by this sentence
>>>>> (but I'm not 100% sure):
>>>>>
>>>>>> When a file is replaced or deleted from the dataset, it’s manifest
>>>>>> entry fields store the snapshot ID in which the file was deleted and 
>>>>>> status
>>>>>> 2 (deleted). The file may be deleted from the file system when the 
>>>>>> snapshot
>>>>>> in which it was deleted is garbage collected, assuming that older 
>>>>>> snapshots
>>>>>> have also been garbage collected [1].
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Micah
>>>>>
>>>>>
>>>>> P.S. Is there a way to add a table of contents to the specification.
>>>>> I might be missing it but I don't see one rendered at:
>>>>> https://iceberg.apache.org/spec/
>>>>>
>>>>

Reply via email to