Hi Dan, > > The paths are not unique across snapshots 1/2, but within each snapshot > they are.
This is exactly the contract I was trying to determine. It seems like it could potentially be clarified in the specification (again apologies if I missed it). I opened a PR to try to add language around these requirements, hopefully this is a helpful contribution. Thanks, Micah [1] https://github.com/apache/iceberg/pull/4272 On Fri, Mar 4, 2022 at 4:41 PM Daniel Weeks <daniel.c.we...@gmail.com> wrote: > I think in the situation you're demonstrating, the manifests are separated > across two separate snapshots. > > Here's an example: > > create table t1 (s string); > insert into t1 values ('foo'); -- snapshot 0, manifest-list with 1 > manifest pointing to file A (ADDED) > insert into t1 values ('bar'); -- snapshot 1, manifest-list with 2 > manifests pointing to file A (ADDED), file B (ADDED) > delete from t1 where s = 'foo'; -- snapshot 2, manifest-list with 2 > manifests pointing to file A (DELETED), file B (ADDED) > > The paths are not unique across snapshots 1/2, but within each snapshot > they are. > > Now in the same case if the data was in the same file, you would have a > rewrite of the datafile like this (assuming no row-level deletes): > > create table t1 (s string); > insert into t1 values ('foo'), ('bar'); -- snapshot 0, manifest-list with > 1 manifest pointing to file A (ADDED) > delete from t1 where s = 'foo'; -- snapshot 1, manifest-list with 1 > manifests pointing to file A (DELETED) + file B (ADDED) > > I hope I'm understanding your example correctly, but let me know if I'm > off track here. > > Thanks, > Dan > > > > On Fri, Mar 4, 2022 at 2:23 PM Micah Kornfield <emkornfi...@gmail.com> > wrote: > >> Hi Dan, >> Thanks for the quick reply. >> >> >>> For #2, the answer follows mostly because if the answer to #1 holds, >>> then yes the pairwise intersection of entries in the manifest files of a >>> given snapshot is empty. >> >> >> Just to be pedantic, even with unique file names. It seems one could >> construct a snapshots as: >> Manifest 1: Add File A >> Manifest 2: Delete File A >> >> From your answer it sounds like this is unexpected and readers generally >> don't try to reconcile Deletes add Adds? >> >> Thanks, >> Micah >> >> On Fri, Mar 4, 2022 at 2:10 PM Daniel Weeks <dwe...@apache.org> wrote: >> >>> Hey Micah, >>> >>> For #1, I don't believe spec clearly calls out that all data/delete >>> files must be unique, but the requirements for cleanup would be violated in >>> certain cases if you had the same file referenced in multiple manifests. >>> In practice, the best way to ensure data correctness and metadata >>> consistency is to ensure that all referenced files have unique locations >>> and that those locations do not get overwritten. >>> >>> For #2, the answer follows mostly because if the answer to #1 holds, >>> then yes the pairwise intersection of entries in the manifest files of a >>> given snapshot is empty. >>> >>> The java library does perform some checks to prevent a file from being >>> added to the same manifest multiple times, but I don't think that >>> extends to all possible ways of adding files. So it may be possible, but >>> not a good idea. >>> >>> Sam might know if there's a way to add a nav for the format page (it is >>> a little difficult to navigate at the moment). >>> >>> -Dan >>> >>> On Thu, Mar 3, 2022 at 4:49 PM Micah Kornfield <emkornfi...@gmail.com> >>> wrote: >>> >>>> Hi Iceberg Dev, >>>> I tried searching for it in the specification but couldn't find >>>> anything explicit: >>>> >>>> 1. Is it assumed that all data files and delete files will always have >>>> globally unique names in a table? >>>> 2. Is it expected that the pairwise intersection of all manifest files >>>> in a snapshot is empty (i.e. For any given data file it has exactly zero or >>>> 1 entries across all manifest files in a snapshot)? >>>> >>>> I think the uniqueness of both can maybe be inferred by this sentence >>>> (but I'm not 100% sure): >>>> >>>>> When a file is replaced or deleted from the dataset, it’s manifest >>>>> entry fields store the snapshot ID in which the file was deleted and >>>>> status >>>>> 2 (deleted). The file may be deleted from the file system when the >>>>> snapshot >>>>> in which it was deleted is garbage collected, assuming that older >>>>> snapshots >>>>> have also been garbage collected [1]. >>>> >>>> >>>> Thanks, >>>> Micah >>>> >>>> >>>> P.S. Is there a way to add a table of contents to the specification. I >>>> might be missing it but I don't see one rendered at: >>>> https://iceberg.apache.org/spec/ >>>> >>>