Hi Ryan,

“Tables must be manually upgraded to version 2 in order to use any of the 
metadata changes we are making” If I understand correctly, for exist iceberg 
table in v1, we have to run some CLI/script to rewrite the metadata.

“Next, we've added sequence numbers and the proposed inheritance scheme to v2, 
along with tests to ensure that v1 is written without sequence numbers and that 
when reading v1 metadata, the sequence numbers are all 0.” To me, this means V2 
reader should be able to read V1 table metadata. Therefore, the step above is 
not required, which only requires us to use a V2 reader on a V1 table.

However, if a table has been written in V1, we want to save it as V2. I expect 
only metadata data will be rewritten into V2 and V1 metadata will be vacuumed 
upon V2 success.

Is my understanding correct?

Thanks!

Miao
From: Ryan Blue <rb...@netflix.com.INVALID>
Reply-To: "dev@iceberg.apache.org" <dev@iceberg.apache.org>, 
"rb...@netflix.com" <rb...@netflix.com>
Date: Tuesday, May 5, 2020 at 5:03 PM
To: Iceberg Dev List <dev@iceberg.apache.org>
Subject: [DISCUSS] Changes for row-level deletes

Hi, everyone,

I know several people that are planning to attend the sync tomorrow are 
interested in the row-level delete work, so I wanted to share some of the 
progress and my current thinking ahead of time.

The codebase now supports a new version number, 2. Tables must be manually 
upgraded to version 2 in order to use any of the metadata changes we are 
making; v1 readers cannot read v2 tables. When a write takes place, the version 
number is now passed to the manifest writer, manifest list writer, etc. and the 
right schema for the table's current version is used. We've also frozen the v1 
schemas and added wrappers to ensure that even as the internal classes, like 
DataFile, evolve, the exact same data is written to v1.

Next, we've added sequence numbers and the proposed inheritance scheme to v2, 
along with tests to ensure that v1 is written without sequence numbers and that 
when reading v1 metadata, the sequence numbers are all 0. This gives us the 
ability to track "when" a row-level delete occurred in a v2 table.

The next steps are to start making larger changes to metadata files.

One change that I've been considering is getting rid of manifest_entry. In v1, 
a manifest stored a manifest_entry that wrapped a data_file. The intent was to 
separate data that API users needed to supply -- fields in data_file -- from 
data that was tracked internally by Iceberg -- the snapshot_id and status 
fields of manifest_entry. If we want to combine these so that a manifest stores 
one top-level data_file struct, then now is the time to make that change. I've 
prototyped this in 
#963<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-iceberg%2Fpull%2F963&data=02%7C01%7Cmiwang%40adobe.com%7C6deae35f2a5b47fd3dbb08d7f150e20d%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637243202006254913&sdata=BF4quqX2Cn%2FL3Ckyi1cpr6h3rkUnWf8MYbCTUugYXgw%3D&reserved=0>.
 The benefit is that the schema is flatter so we wouldn't need two metadata 
tables (entries and files). The main drawback is that we aren't going to stop 
using v1 tables, so we would effectively have two different manifest schemas 
instead of v2 as an evolution of v1. I'd love to hear more opinions on whether 
to do this. I'm leaning toward not merging the two.

Another change is to start adding tracking fields for delete files and updating 
the APIs. The metadata for this is fairly simple: an enum that stores whether 
the file is data, position deletes, or equality deletes. The main decision 
point is whether to allow mixing data files and delete files together in 
manifests. I don't think that we should allow manifests with both delete files 
and data files. The reason is job planning: we want to start emitting splits 
immediately so that we can stream them, instead of holding them all in memory. 
That means we need some way to guarantee that we know all of the delete files 
to apply to a data file before we encounter the data file.

OpenInx suggested sorting by sequence number to see delete files before data 
files, but it still requires holding all splits in memory in the worst case due 
to overlapping sequence number ranges. I think Iceberg should plan a scan in 
two phases: one to find matching delete files (held in memory) and one to find 
matching data files. That solves the problem of having all deletes available so 
a split can be immediately emitted, and also allows parallelizing both phases 
without coordination across threads.

For the two-phase approach, mixing delete files and data files in a manifest 
would require reading that manifest twice, once in each phase. I think it makes 
the most sense to keep delete files and data files in separate manifests. But 
the trade-off is that Iceberg will need to track the content of a manifest 
(deletes or data) and perform actions on separate manifest groups.

Also, because with separate delete and data manifests we _could_ use separate 
manifest schemas, I went through and wrote out a schema for a delete file 
manifest. That schema was so similar to the current data file schema that I 
think it's simpler to use the same one for both.

In summary, here are the things that we need to decide and what I think we 
should do:

* Merge manifest_entry and data_file? I think we should not, to avoid 
additional complexity.
* How should planning with delete files work? The two-phase approach is the 
only one I think is viable.
* Mix delete files and data files in manifests? I think we should not, to 
support the two-phase planning approach.
* If delete files and data files are separate, should manifests use the same 
schema? Yes, because it is simpler.

Let's plan on talking about these questions in tomorrow's sync. And if you have 
other topics, please send them to me!

rb

--
Ryan Blue
Software Engineer
Netflix

Reply via email to