Quick update: I just opened PR 10948 <https://github.com/apache/iceberg/pull/10948> with some prep work for v3. The main change is that it makes the support requirements for unknown transforms clear: * Writers are not allowed to commit data using a partition spec that contains a field with an unknown transform. * Readers should ignore partition fields that use unknown transforms and mare not allowed to commit data using a partition spec that contains a field with an unknown transform.ust ignore them starting in v3. Specifically, in scan planning: "The inclusive projection for an unknown partition transform is _true_ because the partition field is ignored and not used in filtering."
That also cleans up some of the boilerplate to work for v2 and v3. I also have an update on type promotion, but that's a longer issue so I'll start a new thread. On Wed, Aug 7, 2024 at 2:29 PM Ryan Blue <b...@databricks.com> wrote: > we can still discuss the remaining items in the Iceberg geometry proposal: > expression, partition transform > > For these two, I think that we can make them backward-compatible. If we > update the spec to allow adding transforms between major releases, then we > don’t need to have the transform done by v3. Similarly, the expressions are > entirely in the Iceberg library so adding them after v3 is backward > compatible. We don’t think we need v3 to depend on either one! > > if the parquet geometry is accepted, we can quickly update the Iceberg > proposal and start a vote? > > It seems reasonable to me. If the Parquet type is done, we should be able > to get the geometry type in. > > Ryan > > On Tue, Aug 6, 2024 at 6:30 PM Jia Yu <ji...@apache.org> wrote: > >> Hi Ryan, Szehon, and other folks in the thread, >> >> Thanks for summarizing this. I'd love to have the geometry type in V3 >> spec but I'd like to also understand the expected timeline of V3. Is >> there a rough cutoff time? >> >> Currently, we are actively working on the Parquet Geometry type >> proposal. We have resolved almost all concerns in the spec and the >> implementation is in progress. I think this will be accepted in a >> month (of course, pending the vote in the Parquet community). >> >> On the other hand, we can still discuss the remaining items in the >> Iceberg geometry proposal: expression, partition transform. But I >> think we have addressed the concerns on those items? So if the parquet >> geometry is accepted, we can quickly update the Iceberg proposal and >> start a vote? Of course, I'd love to have greelights from a few more >> Iceberg PMC members so we will be more confident on the timeline of >> this proposal. >> >> Thanks, >> Jia >> >> On Tue, Aug 6, 2024 at 4:50 PM Szehon Ho <szehon.apa...@gmail.com> wrote: >> > >> > It makes sense to me, thanks for summarizing it, it's an exciting list >> of new features. >> > >> > For Geo, I will let Wherobots engineers (Jia Yu and others) working >> there to comment, but geo type could take more time, if we wait for >> Parquet-Format change, followed by Parquet implementation release. >> > >> > +1 about multi-value transform, I think it will be great and do-able to >> get those in, the spec allows their existence but its just waiting >> implementation/ review. >> > >> > Thanks >> > Szehon >> > >> > On Tue, Aug 6, 2024 at 4:42 PM Ryan Blue <b...@databricks.com.invalid> >> wrote: >> >> >> >> I’ve been going through the list I’ve accumulated for v3 changes and I >> think we do have a fairly clear set of things that people are working on. >> There are two main areas. The first is centered around types and extending >> existing metadata: >> >> >> >> Add new types: timestamp(ns), variant, blob, and null >> >> Add new type promotion: long to timestamp, >> boolean/int/long/date/time/timestamp/uuid to string, null to anything, most >> types to variant >> >> Add default value support via initial-default, write-default >> >> Add multi-arg transforms (multi-column bucket, zorder) >> >> >> >> Then there are a few bigger items that have people actively working: >> >> >> >> Row-level tracking metadata >> >> Improvements for position delete performance >> >> Encryption metadata >> >> Geo support: geometry type, xz transform, and geo predicates >> >> >> >> I propose that we target the first set of things since that’s a group >> of similar changes. It makes sense (at least to me) to add new types in a >> group, and it also makes sense to extend type capabilities (defaults and >> promotions) at the same time. (Also, we can choose to exclude blob if it is >> a large amount of work) >> >> >> >> I’d include multi-arg transforms in that group since the design is >> well written and nearly done. And we can make sure that there is a >> backward-compatible way to add new transforms between major releases. The >> Java library can currently handle new transforms and if we get those >> details into the spec then we don’t need to get the specifics of multi-arg >> bucketing as part of the v3 release. >> >> >> >> For the second group of projects, I suggest that we continue to >> actively work on them and try to get at least 2 of them in. Encryption >> metadata is quite close and just needs a few table-level additions to the >> metadata file. The changes for row-level tracking and position delete >> performance should be reasonably sized. >> >> >> >> I’d also love to see the geo support in v3, but that’s also a >> well-scoped feature that could be a v4 if it isn’t going to make it in >> time. My main concern here is the size of the changes where I don’t have >> much context. >> >> >> >> In summary, I’d say we should aim to include the new types, promotion, >> default values, and multi-arg transforms. Then include any of the larger >> items that are ready in time. Does that sound reasonable? >> >> >> >> Ryan >> >> >> >> >> >> On Mon, Aug 5, 2024 at 3:20 PM Micah Kornfield <emkornfi...@gmail.com> >> wrote: >> >>>> >> >>>> I suggest keeping those things separate — Micah, would you mind >> starting a separate thread so this one can focus on v3? >> >>> >> >>> >> >>> Yes I'll start another thread on this post V3, to allow for focus on >> closing off V3 with the current process (and see if there is interest in >> trying something new for v4. >> >>> >> >>> Thanks, >> >>> Micah >> >>> >> >>> On Mon, Aug 5, 2024 at 12:17 PM Ryan Blue <b...@databricks.com.invalid> >> wrote: >> >>>> >> >>>> At least for discussion purposes, I think the REST spec (and any >> spec that involves code that will ultimately be consumed) is probably a >> harder conversation. >> >>>> >> >>>> I agree that it’s a very different conversation and probably out of >> scope for the table v3 spec. >> >>>> >> >>>> I’m undecided if minor releases are necessary for non-code specs, >> this seems like it might be too much overhead and might not provide a ton >> of value (maybe you could elaborate on the value you see in it?). >> >>>> >> >>>> Thanks for bringing up the point about minor versions. It’s critical >> to keep in mind that we’re talking about two different types of changes. >> For the v3 discussion, I think the question is what changes we want to add >> in v3, which is an opportunity to group together forward-incompatible >> changes that require new behavior to read tables correctly. >> >>>> >> >>>> It’s great to discuss whether we want to change how we version the >> spec and see if we want to release breaking changes more often. I think >> that was Micah’s original intent for bringing up a regular release cadence >> for the spec. But we should also be aware that this is a separate >> discussion. Most of the points that Micah raised are covered by our >> existing process for new major versions: >> >>>> >> >>>> Add changes to the spec such that they are clearly attached to a >> future version >> >>>> Implement the changes in at least one implementation, probably the >> reference implementation >> >>>> When we have accumulated enough breaking changes, vote to adopt the >> new version >> >>>> >> >>>> There are differences that we may choose to change, like adding the >> changes to the spec rather than keeping them in PRs. And we may want to >> introduce a regular cadence to make the last step more predictable. Those >> are great discussions to have, but right now we know that we have changes >> we want to get into a v3 in the next few months. I suggest keeping those >> things separate — Micah, would you mind starting a separate thread so this >> one can focus on v3? >> >>>> >> >>>> I also see that if we were to go with Micah’s suggestion, it has an >> impact on the decisions that we need to make for the v3 release. But I >> think that even if we were to have a regular release cadence, it would >> still make sense to group features like new types together because it makes >> the versions easier to understand and limits the overall impact in the >> implementations. >> >>>> >> >>>> Ryan >> >>>> >> >>>> >> >>>> On Fri, Aug 2, 2024 at 11:39 AM Micah Kornfield < >> emkornfi...@gmail.com> wrote: >> >>>>>> >> >>>>>> I have been a big advocate for releasing all the Iceberg specs >> regularly, and just follow a normal product release cycle with major and >> minor releases. I touched a bit of the reasoning in the thread for fixing >> stats fields in REST spec [1]. This helps a lot with engines that do not >> use any Iceberg open source library and just look at a spec and implement >> it. With a regular release, they can have a stable version to look into, >> rather than a spec that is changing all the time within the same version. >> >>>>> >> >>>>> >> >>>>> At least for discussion purposes, I think the REST spec (and any >> spec that involves code that will ultimately be consumed) is probably a >> harder conversation. I'm undecided if minor releases are necessary for >> non-code specs, this seems like it might be too much overhead and might not >> provide a ton of value (maybe you could elaborate on the value you see in >> it?). >> >>>>> >> >>>>>> >> >>>>>> I think Fokko brought up a point that "this will introduce a >> process that will slow the evolution down", which is true because you need >> to spend additional effort and release it. And without a reference >> implementation, it is hard to say if the spec is mature enough to be >> released, which again makes it potentially tied to the release cycle of at >> least the Java library. >> >>>>> >> >>>>> >> >>>>> Sorry I think I missed Fokko's argument on the linked thread. In >> my mind, the order of operations on non-code spec changes would be: >> >>>>> >> >>>>> 1. Spec change is proposed/reviewed and agreed upon but not merged. >> >>>>> 2. Reference implementation happens (possibly with revisions if >> implementation challenges arise). >> >>>>> 3. Reference implementation is merged >> >>>>> 4. Spec change is merged. >> >>>>> 5. Spec is officially "released" at some normal cadence (or in >> theory it could be done immediately). >> >>>>> >> >>>>> Steps 3 and 4 could happen simultaneously, or 4 could potentially >> have some lag to it to allow for further feedback (i.e. letting reference >> implementation be released) and revision. >> >>>>> >> >>>>> If step 5 is done immediately after step 4, I don't think this >> would slow down evolution (but comes at the cost of more versions). Part >> of step five would necessitate changing code for any incomplete >> implementations to only be turned on in the next revision (or larger >> features could be worked on in a separate branch to avoid this >> complication). >> >>>>> >> >>>>> Thanks, >> >>>>> Micah >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> On Fri, Aug 2, 2024 at 9:10 AM Jack Ye <yezhao...@gmail.com> wrote: >> >>>>>> >> >>>>>> > An alternative view: Would it make sense to start releasing the >> table specification on a regular cadence (e.g. quarterly, every 6 months or >> yearly)? >> >>>>>> >> >>>>>> I have been a big advocate for releasing all the Iceberg specs >> regularly, and just follow a normal product release cycle with major and >> minor releases. I touched a bit of the reasoning in the thread for fixing >> stats fields in REST spec [1]. This helps a lot with engines that do not >> use any Iceberg open source library and just look at a spec and implement >> it. With a regular release, they can have a stable version to look into, >> rather than a spec that is changing all the time within the same version. >> >>>>>> >> >>>>>> It is important to note that minor spec versions will not be >> leveraged in implementations like how we have logics right now for >> switching behaviors depending on major versions. It is purely for the >> purpose of making more incremental progress on the spec, and providing >> stable spec versions for other reference implementations. Otherwise, the >> branches in the codebase to handle different versions easily get out of >> control. >> >>>>>> >> >>>>>> I think Fokko brought up a point that "this will introduce a >> process that will slow the evolution down", which is true because you need >> to spend additional effort and release it. And without a reference >> implementation, it is hard to say if the spec is mature enough to be >> released, which again makes it potentially tied to the release cycle of at >> least the Java library. >> >>>>>> >> >>>>>> Curious what people think. >> >>>>>> >> >>>>>> Best, >> >>>>>> Jack Ye >> >>>>>> >> >>>>>> [1] >> https://lists.apache.org/thread/v6x772v9sgo0xhpwmh4br756zhbgomtf >> >>>>>> >> >>>>>> On Wed, Jul 31, 2024 at 10:19 PM Micah Kornfield < >> emkornfi...@gmail.com> wrote: >> >>>>>>> >> >>>>>>> It sounds like most of the opinions so far are waiting for the >> scope of work to finish before finalizing the specification. >> >>>>>>> >> >>>>>>> An alternative view: Would it make sense to start releasing the >> table specification on a regular cadence (e.g. quarterly, every 6 months or >> yearly)? I think the problem with waiting for features to get in is that >> priorities change and things take longer than expected, thus leaving the >> actual finalization of the specification in limbo and probably adds to >> project management overhead. If the specification is released regularly >> then it means features can always be included in the next release without >> too much delay hopefully. The main downside I can think of in this >> approach is having to have more branches in code to handle different >> versions. >> >>>>>>> >> >>>>>>> One corollary to this approach is spec changes shouldn't be >> merged before their implementations are ready. >> >>>>>>> >> >>>>>>>> - At least one complete reference implementation should exist. >> >>>>>>> >> >>>>>>> >> >>>>>>> For more complicated features I think at some point soon it might >> be worth considering two implementations (or at least 1 full implementation >> and 1 read only implementation) to make sure there aren't compatibility >> issues/misunderstandings in the specification (e.g. I think Variant and >> Geography fall into this category). >> >>>>>>> >> >>>>>>> Cheers, >> >>>>>>> Micah >> >>>>>>> >> >>>>>>> On Wed, Jul 31, 2024 at 12:47 PM Russell Spitzer < >> russell.spit...@gmail.com> wrote: >> >>>>>>>> >> >>>>>>>> I think this all sounds good, the real question is whether or >> not we have someone to actively work on the proposals. I think for things >> like Default Values and Geo Types we have folks actively working on them so >> it's not a big deal. >> >>>>>>>> >> >>>>>>>> On Wed, Jul 31, 2024 at 2:09 PM Szehon Ho < >> szehon.apa...@gmail.com> wrote: >> >>>>>>>>> >> >>>>>>>>> Sorry I missed the sync this morning (sick), I'd like to push >> for geo too. >> >>>>>>>>> >> >>>>>>>>> I think on this front as per the last sync, Ryan recommended to >> wait for Parquet support to land, to avoid having two versions on Iceberg >> side (Iceberg-native vs Parquet-native). Parquet support is being actively >> worked on iiuc: https://github.com/apache/parquet-format/pull/240 . But >> it would bind V3 to the parquet-format release timeline, unless we start >> with iceberg-native support first and move later (as we originally >> proposed). >> >>>>>>>>> >> >>>>>>>>> Thanks, >> >>>>>>>>> Szehon >> >>>>>>>>> >> >>>>>>>>> On Wed, Jul 31, 2024 at 10:58 AM Walaa Eldin Moustafa < >> wa.moust...@gmail.com> wrote: >> >>>>>>>>>> >> >>>>>>>>>> Another feature that was planned for V3 is support for default >> values. >> >>>>>>>>>> Spec doc update was already merged a while ago [1]. >> Implementation is >> >>>>>>>>>> ongoing in this PR [2]. >> >>>>>>>>>> >> >>>>>>>>>> [1] https://iceberg.apache.org/spec/#default-values >> >>>>>>>>>> [2] https://github.com/apache/iceberg/pull/9502 >> >>>>>>>>>> >> >>>>>>>>>> Thanks, >> >>>>>>>>>> Walaa. >> >>>>>>>>>> >> >>>>>>>>>> On Wed, Jul 31, 2024 at 10:52 AM Russell Spitzer >> >>>>>>>>>> <russell.spit...@gmail.com> wrote: >> >>>>>>>>>> > >> >>>>>>>>>> > Thanks for bringing this up, I would say that from my >> perspective I have time to really push through hopefully two things >> >>>>>>>>>> > >> >>>>>>>>>> > Variant Type and >> >>>>>>>>>> > Row Lineage (which I will have a proposal for on the mailing >> list next week) >> >>>>>>>>>> > >> >>>>>>>>>> > I'm using the Project to try to track logistics and minutia >> required for the new spec version but I would like to bring other work in >> there as well so we can get a clear picture of what is actually being >> actively worked on. >> >>>>>>>>>> > >> >>>>>>>>>> > On Wed, Jul 31, 2024 at 12:27 PM Jacob Marble < >> jacobmar...@influxdata.com> wrote: >> >>>>>>>>>> >> >> >>>>>>>>>> >> Good morning, >> >>>>>>>>>> >> >> >>>>>>>>>> >> To continue the community sync today when format version 3 >> was discussed. >> >>>>>>>>>> >> >> >>>>>>>>>> >> Questions answered by consensus: >> >>>>>>>>>> >> - Format version releases should _not_ be tied to Iceberg >> version releases. >> >>>>>>>>>> >> - Several planned features will require format version >> releases; the process shouldn't be onerous. >> >>>>>>>>>> >> >> >>>>>>>>>> >> Unanswered questions: >> >>>>>>>>>> >> - What will be included in format version 3? >> >>>>>>>>>> >> - What is a reasonable target date? >> >>>>>>>>>> >> - How to track progress? Today, there are two public >> lists: >> >>>>>>>>>> >> - GH milestone: >> https://github.com/apache/iceberg/milestone/42 >> >>>>>>>>>> >> - GH project: >> https://github.com/orgs/apache/projects/377 >> >>>>>>>>>> >> - What is required of a feature in order to be included in >> any adopted format version? >> >>>>>>>>>> >> - At least one complete reference implementation should >> exist. >> >>>>>>>>>> >> - Java is the reference implementation by convention; >> that's OK, but not perfect. Should Java be the reference implementation by >> mandate? >> >>>>>>>>>> >> >> >>>>>>>>>> >> Have I missed anything? >> >>>>>>>>>> >> >> >>>>>>>>>> >> -- >> >>>>>>>>>> >> Jacob Marble >> >>>> >> >>>> >> >>>> >> >>>> -- >> >>>> Ryan Blue >> >>>> Databricks >> >> >> >> >> >> >> >> -- >> >> Ryan Blue >> >> Databricks >> > > > -- > Ryan Blue > Databricks > -- Ryan Blue Databricks