Upserts in Iceberg

Ryan Blue Wed, 19 Jun 2019 15:39:32 -0700

Replies inline.

On Mon, Jun 17, 2019 at 10:59 AM Erik Wright erik.wri...@shopify.com.invalid
<http://mailto:erik.wri...@shopify.com.invalid> wrote:


Because snapshots and versions are basically the same idea, we don’t need
>> both. If we were to add versions, they should replace snapshots.
>>
>
> I'm a little confused by this. You mention sequence numbers quite a bit,
> but then say that we should not introduce versions. From my understanding,
> sequence numbers and versions are essentially identical. What has changed
> is where they are encoded (in file metadata vs. in snapshot metadata). It
> would also seem necessary to keep them distinct from snapshots in order to
> be able to collapse/compact files older than N (oldest valid version)
> without losing the ability to incrementally apply the changes from N + 1.
>
Snapshots are versions of the table, but Iceberg does not use a
monotonically increasing identifier to track them. Your proposal added
table versions in addition to snapshots to get a monotonically increasing
ID, but my point is that we don’t need to duplicate the “table state at
some time” idea by adding a layer of versions inside snapshots. We just
need a monotonically increasing ID associated with each snapshot. That ID
what I’m calling a sequence number. Each snapshot will have a new sequence
number used to track the order of files for applying changes that are
introduced by that snapshot.

With sequence numbers embedded in metadata, you can still compact. If you
have a file with sequence number N, then updated by sequence numbers N+2
and N+3, you can compact the changes from N and N+2 by writing the
compacted file with sequence N+2. Rewriting files in N+2 is still possible,
just like with the scheme you suggested, where the version N+2 would be
rewritten.

A key difference is that because the approach using sequence numbers
doesn’t require keeping state around in the root metadata file, tables are
not required to rewrite or compact data files just to keep the table
functioning. If the data won’t be read, then why bother changing it?

Because I am having a really hard time seeing how things would work without
> these sequence numbers (versions), and given their prominence in your
> reply, my remaining comments assume they are present, distinct from
> snapshots, and monotonically increasing.
>
Yes, a sequence number would be added to the existing metadata. Each
snapshot would produce a new sequence number.

Another way of thinking about it: if versions are stored as data/delete
>> file metadata and not in the table metadata file, then the complete history
>> can be kept.
>
>
> Sure. Although version expiry would still be an important process to
> improve read performance and reduce storage overhead.
>
Maybe. If it is unlikely that the data will be read, the best option may be
to leave the original data file and associated delete file.

Based on your proposal, it would seem that we can basically collapse any
> given data file with its corresponding deletions. It's important to
> consider how this will affect incremental readers. The following make sense
> to me:
>
>    1. We should track an oldest-valid-version in the table metadata. No
>    changes newer than this version should be collapsed, as a reader who has
>    consumed up to that version must be able to continue reading.
>
> Table versions are tracked by snapshots, which are kept for a window of
time. The oldest valid version is the oldest snapshot in the table. New
snapshots can always compact or rewrite older data and incremental
consumers ignore those changes by skipping snapshots where the operation
was replace because the data has not changed.

Incremental consumers operate on append, delete, and overwrite snapshots
and consume data files that are added or deleted. With delete files, these
would also consume delete files that were added in a snapshot.

It is unlikely that sequence number would be used by incremental consumers
because the changes are already available for each snapshot. This
distinction is the main reason why I’m calling these sequence numbers and
not “version”. The sequence number for a file indicates what delete files
need to be applied; files can be rewritten the new copy gets a different
sequence number as you describe just below.


>    1. Any data file strictly older than the oldest-valid-version is
>    eligible for delete collapsing. During delete collapsing, all deletions up
>    to version N (any version <= the oldest-valid-version) are applied to the
>    file. *The newly created file should be assigned a sequence number of
>    N.*
>
> First, a data file is kept until the last snapshot that references it is
expired. That enable incremental consumers without a restriction on
rewriting the file in a later replace snapshot.

Second, I agree with your description of rewriting the file. That’s what I
was trying to say above.


>    1. Any delete file that has been completely applied is eligible for
>    expiry. A delete file is completely applied if its sequence number is <=
>    all data files to which it may apply (based on partitioning data).
>
> Yes, I agree.


>    1. Any data file equal to or older than the oldest-valid-version is
>    eligible for compaction. During compaction, some or all data files up to
>    version N (any version <= the oldest-valid version) are selected. There may
>    not be any deletions applicable to any of the selected data files. The
>    selected data files are read, rewritten as desired (partitioning,
>    compaction, sorting, etc.). The new files are included in the new snapshot 
> (*with
>    sequence number N*) while the old files are dropped.
>
> Yes, I agree.


>    1. Delete collapsing, delete expiry, and compaction may be performed
>    in a single commit. The oldest-valid-version may be advanced during this
>    process as well. The outcome must logically be consistent with applying
>    these steps independently.
>
> Because snapshots (table versions) are independent, they can always be
expired. Incremental consumers must process a new snapshot before it is old
enough to expire. The time period after which snapshots are expired is up
to the process running ExpireSnapshots.

Important to note in the above is that, during collapsing/compaction, new
> files may be emitted with a version number N that are older than the most
> recent version. This is important to ensure that all deltas newer than N
> are appropriately applied, and that incremental readers are able to
> continue processing the dataset.
>
Because incremental consumers operate using snapshots and not sequence
numbers, I think this is decoupled in the approach I’m proposing.

It is also compatible with file-level deletes used to age off old data
>> (e.g. delete where day(ts) < '2019-01-01').
>
>
> This particular operation is not compatible with incremental consumption.
>
I disagree. This just encodes deletes for an entire file worth of rows in
the manifest file instead of in a delete file. Current metadata tracks when
files are deleted and we would also associate a sequence number with those
changes, if we needed to. Incremental consumers will work like they do
today: they will consume snapshots that contain changes
(append/delete/overwrite) and will get the files that were changed in that
snapshot.

Also, being able to cleanly age off data is a hard requirement for Iceberg.
We all need to comply with data policies with age limits and we need to
ensure that we can cleanly apply those changes.

One must still generate a deletion that is associated with a new sequence
> number. I could see a reasonable workaround where, in order to delete an
> entire file (insertion file `x.dat`, sequence number X) one would reference
> the same file a second time (deletion file `x.dat, sequence number Y > X).
> Since an insertion file and a deletion file have compatible formats, there
> is no need to rewrite this file simply to mark each row in it as deleted. A
> savvy consumer would be able to tell that this is a whole-file deletion
> while a naïve consumer would be able to apply them using the basic
> algorithm.
>
With file-level deletes, the files are no longer processed by readers,
unless an incremental reader needs to read all of the deletes from the file.

If it seems like we are roughly on the same page I will take a stab at
> updating that document to go to the same level of detail that it does now
> but use the sequence-number approach.
>
Sure, that sounds good.
-- 
Ryan Blue
Software Engineer
Netflix

Re: Updates/Deletes/Upserts in Iceberg

Reply via email to