Hi Ryan, “Tables must be manually upgraded to version 2 in order to use any of the metadata changes we are making” If I understand correctly, for exist iceberg table in v1, we have to run some CLI/script to rewrite the metadata.
“Next, we've added sequence numbers and the proposed inheritance scheme to v2, along with tests to ensure that v1 is written without sequence numbers and that when reading v1 metadata, the sequence numbers are all 0.” To me, this means V2 reader should be able to read V1 table metadata. Therefore, the step above is not required, which only requires us to use a V2 reader on a V1 table. However, if a table has been written in V1, we want to save it as V2. I expect only metadata data will be rewritten into V2 and V1 metadata will be vacuumed upon V2 success. Is my understanding correct? Thanks! Miao From: Ryan Blue <rb...@netflix.com.INVALID> Reply-To: "dev@iceberg.apache.org" <dev@iceberg.apache.org>, "rb...@netflix.com" <rb...@netflix.com> Date: Tuesday, May 5, 2020 at 5:03 PM To: Iceberg Dev List <dev@iceberg.apache.org> Subject: [DISCUSS] Changes for row-level deletes Hi, everyone, I know several people that are planning to attend the sync tomorrow are interested in the row-level delete work, so I wanted to share some of the progress and my current thinking ahead of time. The codebase now supports a new version number, 2. Tables must be manually upgraded to version 2 in order to use any of the metadata changes we are making; v1 readers cannot read v2 tables. When a write takes place, the version number is now passed to the manifest writer, manifest list writer, etc. and the right schema for the table's current version is used. We've also frozen the v1 schemas and added wrappers to ensure that even as the internal classes, like DataFile, evolve, the exact same data is written to v1. Next, we've added sequence numbers and the proposed inheritance scheme to v2, along with tests to ensure that v1 is written without sequence numbers and that when reading v1 metadata, the sequence numbers are all 0. This gives us the ability to track "when" a row-level delete occurred in a v2 table. The next steps are to start making larger changes to metadata files. One change that I've been considering is getting rid of manifest_entry. In v1, a manifest stored a manifest_entry that wrapped a data_file. The intent was to separate data that API users needed to supply -- fields in data_file -- from data that was tracked internally by Iceberg -- the snapshot_id and status fields of manifest_entry. If we want to combine these so that a manifest stores one top-level data_file struct, then now is the time to make that change. I've prototyped this in #963<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-iceberg%2Fpull%2F963&data=02%7C01%7Cmiwang%40adobe.com%7C6deae35f2a5b47fd3dbb08d7f150e20d%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637243202006254913&sdata=BF4quqX2Cn%2FL3Ckyi1cpr6h3rkUnWf8MYbCTUugYXgw%3D&reserved=0>. The benefit is that the schema is flatter so we wouldn't need two metadata tables (entries and files). The main drawback is that we aren't going to stop using v1 tables, so we would effectively have two different manifest schemas instead of v2 as an evolution of v1. I'd love to hear more opinions on whether to do this. I'm leaning toward not merging the two. Another change is to start adding tracking fields for delete files and updating the APIs. The metadata for this is fairly simple: an enum that stores whether the file is data, position deletes, or equality deletes. The main decision point is whether to allow mixing data files and delete files together in manifests. I don't think that we should allow manifests with both delete files and data files. The reason is job planning: we want to start emitting splits immediately so that we can stream them, instead of holding them all in memory. That means we need some way to guarantee that we know all of the delete files to apply to a data file before we encounter the data file. OpenInx suggested sorting by sequence number to see delete files before data files, but it still requires holding all splits in memory in the worst case due to overlapping sequence number ranges. I think Iceberg should plan a scan in two phases: one to find matching delete files (held in memory) and one to find matching data files. That solves the problem of having all deletes available so a split can be immediately emitted, and also allows parallelizing both phases without coordination across threads. For the two-phase approach, mixing delete files and data files in a manifest would require reading that manifest twice, once in each phase. I think it makes the most sense to keep delete files and data files in separate manifests. But the trade-off is that Iceberg will need to track the content of a manifest (deletes or data) and perform actions on separate manifest groups. Also, because with separate delete and data manifests we _could_ use separate manifest schemas, I went through and wrote out a schema for a delete file manifest. That schema was so similar to the current data file schema that I think it's simpler to use the same one for both. In summary, here are the things that we need to decide and what I think we should do: * Merge manifest_entry and data_file? I think we should not, to avoid additional complexity. * How should planning with delete files work? The two-phase approach is the only one I think is viable. * Mix delete files and data files in manifests? I think we should not, to support the two-phase planning approach. * If delete files and data files are separate, should manifests use the same schema? Yes, because it is simpler. Let's plan on talking about these questions in tomorrow's sync. And if you have other topics, please send them to me! rb -- Ryan Blue Software Engineer Netflix