Re: [DISCUSS] adoption of format version 3

Ryan Blue Thu, 15 Aug 2024 16:52:27 -0700

Quick update: I just opened PR 10948
<https://github.com/apache/iceberg/pull/10948> with some prep work for v3.
The main change is that it makes the support requirements for unknown
transforms clear:
* Writers are not allowed to commit data using a partition spec that
contains a field with an unknown transform.
* Readers should ignore partition fields that use unknown transforms and
mare not allowed to commit data using a partition spec that contains a
field with an unknown transform.ust ignore them starting in v3.
Specifically, in scan planning: "The inclusive projection for an unknown
partition transform is _true_ because the partition field is ignored and
not used in filtering."


That also cleans up some of the boilerplate to work for v2 and v3.

I also have an update on type promotion, but that's a longer issue so I'll
start a new thread.

On Wed, Aug 7, 2024 at 2:29 PM Ryan Blue <b...@databricks.com> wrote:

> we can still discuss the remaining items in the Iceberg geometry proposal:
> expression, partition transform
>
> For these two, I think that we can make them backward-compatible. If we
> update the spec to allow adding transforms between major releases, then we
> don’t need to have the transform done by v3. Similarly, the expressions are
> entirely in the Iceberg library so adding them after v3 is backward
> compatible. We don’t think we need v3 to depend on either one!
>
> if the parquet geometry is accepted, we can quickly update the Iceberg
> proposal and start a vote?
>
> It seems reasonable to me. If the Parquet type is done, we should be able
> to get the geometry type in.
>
> Ryan
>
> On Tue, Aug 6, 2024 at 6:30 PM Jia Yu <ji...@apache.org> wrote:
>
>> Hi Ryan, Szehon, and other folks in the thread,
>>
>> Thanks for summarizing this. I'd love to have the geometry type in V3
>> spec but I'd like to also understand the expected timeline of V3. Is
>> there a rough cutoff time?
>>
>> Currently, we are actively working on the Parquet Geometry type
>> proposal. We have resolved almost all concerns in the spec and the
>> implementation is in progress. I think this will be accepted in a
>> month (of course, pending the vote in the Parquet community).
>>
>> On the other hand, we can still discuss the remaining items in the
>> Iceberg geometry proposal: expression, partition transform. But I
>> think we have addressed the concerns on those items? So if the parquet
>> geometry is accepted, we can quickly update the Iceberg proposal and
>> start a vote? Of course, I'd love to have greelights from a few more
>> Iceberg PMC members so we will be more confident on the timeline of
>> this proposal.
>>
>> Thanks,
>> Jia
>>
>> On Tue, Aug 6, 2024 at 4:50 PM Szehon Ho <szehon.apa...@gmail.com> wrote:
>> >
>> > It makes sense to me, thanks for summarizing it, it's an exciting list
>> of new features.
>> >
>> > For Geo, I will let Wherobots engineers (Jia Yu and others) working
>> there to comment, but geo type could take more time, if we wait for
>> Parquet-Format change, followed by Parquet implementation release.
>> >
>> > +1 about multi-value transform, I think it will be great and do-able to
>> get those in, the spec allows their existence but its just waiting
>> implementation/ review.
>> >
>> > Thanks
>> > Szehon
>> >
>> > On Tue, Aug 6, 2024 at 4:42 PM Ryan Blue <b...@databricks.com.invalid>
>> wrote:
>> >>
>> >> I’ve been going through the list I’ve accumulated for v3 changes and I
>> think we do have a fairly clear set of things that people are working on.
>> There are two main areas. The first is centered around types and extending
>> existing metadata:
>> >>
>> >> Add new types: timestamp(ns), variant, blob, and null
>> >> Add new type promotion: long to timestamp,
>> boolean/int/long/date/time/timestamp/uuid to string, null to anything, most
>> types to variant
>> >> Add default value support via initial-default, write-default
>> >> Add multi-arg transforms (multi-column bucket, zorder)
>> >>
>> >> Then there are a few bigger items that have people actively working:
>> >>
>> >> Row-level tracking metadata
>> >> Improvements for position delete performance
>> >> Encryption metadata
>> >> Geo support: geometry type, xz transform, and geo predicates
>> >>
>> >> I propose that we target the first set of things since that’s a group
>> of similar changes. It makes sense (at least to me) to add new types in a
>> group, and it also makes sense to extend type capabilities (defaults and
>> promotions) at the same time. (Also, we can choose to exclude blob if it is
>> a large amount of work)
>> >>
>> >> I’d include multi-arg transforms in that group since the design is
>> well written and nearly done. And we can make sure that there is a
>> backward-compatible way to add new transforms between major releases. The
>> Java library can currently handle new transforms and if we get those
>> details into the spec then we don’t need to get the specifics of multi-arg
>> bucketing as part of the v3 release.
>> >>
>> >> For the second group of projects, I suggest that we continue to
>> actively work on them and try to get at least 2 of them in. Encryption
>> metadata is quite close and just needs a few table-level additions to the
>> metadata file. The changes for row-level tracking and position delete
>> performance should be reasonably sized.
>> >>
>> >> I’d also love to see the geo support in v3, but that’s also a
>> well-scoped feature that could be a v4 if it isn’t going to make it in
>> time. My main concern here is the size of the changes where I don’t have
>> much context.
>> >>
>> >> In summary, I’d say we should aim to include the new types, promotion,
>> default values, and multi-arg transforms. Then include any of the larger
>> items that are ready in time. Does that sound reasonable?
>> >>
>> >> Ryan
>> >>
>> >>
>> >> On Mon, Aug 5, 2024 at 3:20 PM Micah Kornfield <emkornfi...@gmail.com>
>> wrote:
>> >>>>
>> >>>> I suggest keeping those things separate — Micah, would you mind
>> starting a separate thread so this one can focus on v3?
>> >>>
>> >>>
>> >>> Yes I'll start another thread on this post V3, to allow for focus on
>> closing off V3 with the current process (and see if there is interest in
>> trying something new for v4.
>> >>>
>> >>> Thanks,
>> >>> Micah
>> >>>
>> >>> On Mon, Aug 5, 2024 at 12:17 PM Ryan Blue <b...@databricks.com.invalid>
>> wrote:
>> >>>>
>> >>>> At least for discussion purposes, I think the REST spec (and any
>> spec that involves code that will ultimately be consumed) is probably a
>> harder conversation.
>> >>>>
>> >>>> I agree that it’s a very different conversation and probably out of
>> scope for the table v3 spec.
>> >>>>
>> >>>> I’m undecided if minor releases are necessary for non-code specs,
>> this seems like it might be too much overhead and might not provide a ton
>> of value (maybe you could elaborate on the value you see in it?).
>> >>>>
>> >>>> Thanks for bringing up the point about minor versions. It’s critical
>> to keep in mind that we’re talking about two different types of changes.
>> For the v3 discussion, I think the question is what changes we want to add
>> in v3, which is an opportunity to group together forward-incompatible
>> changes that require new behavior to read tables correctly.
>> >>>>
>> >>>> It’s great to discuss whether we want to change how we version the
>> spec and see if we want to release breaking changes more often. I think
>> that was Micah’s original intent for bringing up a regular release cadence
>> for the spec. But we should also be aware that this is a separate
>> discussion. Most of the points that Micah raised are covered by our
>> existing process for new major versions:
>> >>>>
>> >>>> Add changes to the spec such that they are clearly attached to a
>> future version
>> >>>> Implement the changes in at least one implementation, probably the
>> reference implementation
>> >>>> When we have accumulated enough breaking changes, vote to adopt the
>> new version
>> >>>>
>> >>>> There are differences that we may choose to change, like adding the
>> changes to the spec rather than keeping them in PRs. And we may want to
>> introduce a regular cadence to make the last step more predictable. Those
>> are great discussions to have, but right now we know that we have changes
>> we want to get into a v3 in the next few months. I suggest keeping those
>> things separate — Micah, would you mind starting a separate thread so this
>> one can focus on v3?
>> >>>>
>> >>>> I also see that if we were to go with Micah’s suggestion, it has an
>> impact on the decisions that we need to make for the v3 release. But I
>> think that even if we were to have a regular release cadence, it would
>> still make sense to group features like new types together because it makes
>> the versions easier to understand and limits the overall impact in the
>> implementations.
>> >>>>
>> >>>> Ryan
>> >>>>
>> >>>>
>> >>>> On Fri, Aug 2, 2024 at 11:39 AM Micah Kornfield <
>> emkornfi...@gmail.com> wrote:
>> >>>>>>
>> >>>>>> I have been a big advocate for releasing all the Iceberg specs
>> regularly, and just follow a normal product release cycle with major and
>> minor releases. I touched a bit of the reasoning in the thread for fixing
>> stats fields in REST spec [1]. This helps a lot with engines that do not
>> use any Iceberg open source library and just look at a spec and implement
>> it. With a regular release, they can have a stable version to look into,
>> rather than a spec that is changing all the time within the same version.
>> >>>>>
>> >>>>>
>> >>>>> At least for discussion purposes, I think the REST spec (and any
>> spec that involves code that will ultimately be consumed) is probably a
>> harder conversation.  I'm undecided if minor releases are necessary for
>> non-code specs, this seems like it might be too much overhead and might not
>> provide a ton of value (maybe you could elaborate on the value you see in
>> it?).
>> >>>>>
>> >>>>>>
>> >>>>>> I think Fokko brought up a point that "this will introduce a
>> process that will slow the evolution down", which is true because you need
>> to spend additional effort and release it. And without a reference
>> implementation, it is hard to say if the spec is mature enough to be
>> released, which again makes it potentially tied to the release cycle of at
>> least the Java library.
>> >>>>>
>> >>>>>
>> >>>>> Sorry I think I missed Fokko's argument on the linked thread.  In
>> my mind, the order of operations on non-code spec changes would be:
>> >>>>>
>> >>>>> 1.  Spec change is proposed/reviewed and agreed upon but not merged.
>> >>>>> 2.  Reference implementation happens (possibly with revisions if
>> implementation challenges arise).
>> >>>>> 3.  Reference implementation is merged
>> >>>>> 4.  Spec change is merged.
>> >>>>> 5.  Spec is officially  "released" at some normal cadence (or in
>> theory it could be done immediately).
>> >>>>>
>> >>>>> Steps 3 and 4 could happen simultaneously, or 4 could potentially
>> have some lag to it to allow for further feedback (i.e. letting reference
>> implementation be released) and revision.
>> >>>>>
>> >>>>> If step 5 is done immediately after step 4, I don't think this
>> would slow down evolution (but comes at the cost of more versions).  Part
>> of step five would necessitate changing code for any incomplete
>> implementations to only be turned on in the next revision (or larger
>> features could be worked on in a separate branch to avoid this
>> complication).
>> >>>>>
>> >>>>> Thanks,
>> >>>>> Micah
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> On Fri, Aug 2, 2024 at 9:10 AM Jack Ye <yezhao...@gmail.com> wrote:
>> >>>>>>
>> >>>>>> > An alternative view: Would it make sense to start releasing the
>> table specification on a regular cadence (e.g. quarterly, every 6 months or
>> yearly)?
>> >>>>>>
>> >>>>>> I have been a big advocate for releasing all the Iceberg specs
>> regularly, and just follow a normal product release cycle with major and
>> minor releases. I touched a bit of the reasoning in the thread for fixing
>> stats fields in REST spec [1]. This helps a lot with engines that do not
>> use any Iceberg open source library and just look at a spec and implement
>> it. With a regular release, they can have a stable version to look into,
>> rather than a spec that is changing all the time within the same version.
>> >>>>>>
>> >>>>>> It is important to note that minor spec versions will not be
>> leveraged in implementations like how we have logics right now for
>> switching behaviors depending on major versions. It is purely for the
>> purpose of making more incremental progress on the spec, and providing
>> stable spec versions for other reference implementations. Otherwise, the
>> branches in the codebase to handle different versions easily get out of
>> control.
>> >>>>>>
>> >>>>>> I think Fokko brought up a point that "this will introduce a
>> process that will slow the evolution down", which is true because you need
>> to spend additional effort and release it. And without a reference
>> implementation, it is hard to say if the spec is mature enough to be
>> released, which again makes it potentially tied to the release cycle of at
>> least the Java library.
>> >>>>>>
>> >>>>>> Curious what people think.
>> >>>>>>
>> >>>>>> Best,
>> >>>>>> Jack Ye
>> >>>>>>
>> >>>>>> [1]
>> https://lists.apache.org/thread/v6x772v9sgo0xhpwmh4br756zhbgomtf
>> >>>>>>
>> >>>>>> On Wed, Jul 31, 2024 at 10:19 PM Micah Kornfield <
>> emkornfi...@gmail.com> wrote:
>> >>>>>>>
>> >>>>>>> It sounds like most of the opinions so far are waiting for the
>> scope of work to finish before finalizing the specification.
>> >>>>>>>
>> >>>>>>> An alternative view: Would it make sense to start releasing the
>> table specification on a regular cadence (e.g. quarterly, every 6 months or
>> yearly)?  I think the problem with waiting for features to get in is that
>> priorities change and things take longer than expected, thus leaving the
>> actual finalization of the specification in limbo and probably adds to
>> project management overhead.   If the specification is released regularly
>> then it means features can always be included in the next release without
>> too much delay hopefully.  The main downside I can think of in this
>> approach is having to have more branches in code to handle different
>> versions.
>> >>>>>>>
>> >>>>>>> One corollary to this approach is spec changes shouldn't be
>> merged before their implementations are ready.
>> >>>>>>>
>> >>>>>>>>   - At least one complete reference implementation should exist.
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> For more complicated features I think at some point soon it might
>> be worth considering two implementations (or at least 1 full implementation
>> and 1 read only implementation) to make sure there aren't compatibility
>> issues/misunderstandings in the specification (e.g. I think Variant and
>> Geography fall into this category).
>> >>>>>>>
>> >>>>>>> Cheers,
>> >>>>>>> Micah
>> >>>>>>>
>> >>>>>>> On Wed, Jul 31, 2024 at 12:47 PM Russell Spitzer <
>> russell.spit...@gmail.com> wrote:
>> >>>>>>>>
>> >>>>>>>> I think this all sounds good, the real question is whether or
>> not we have someone to actively work on the proposals. I think for things
>> like Default Values and Geo Types we have folks actively working on them so
>> it's not a big deal.
>> >>>>>>>>
>> >>>>>>>> On Wed, Jul 31, 2024 at 2:09 PM Szehon Ho <
>> szehon.apa...@gmail.com> wrote:
>> >>>>>>>>>
>> >>>>>>>>> Sorry I missed the sync this morning (sick), I'd like to push
>> for geo too.
>> >>>>>>>>>
>> >>>>>>>>> I think on this front as per the last sync, Ryan recommended to
>> wait for Parquet support to land, to avoid having two versions on Iceberg
>> side (Iceberg-native vs Parquet-native).  Parquet support is being actively
>> worked on iiuc: https://github.com/apache/parquet-format/pull/240 .  But
>> it would bind V3 to the parquet-format release timeline, unless we start
>> with iceberg-native support first and move later (as we originally
>> proposed).
>> >>>>>>>>>
>> >>>>>>>>> Thanks,
>> >>>>>>>>> Szehon
>> >>>>>>>>>
>> >>>>>>>>> On Wed, Jul 31, 2024 at 10:58 AM Walaa Eldin Moustafa <
>> wa.moust...@gmail.com> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>> Another feature that was planned for V3 is support for default
>> values.
>> >>>>>>>>>> Spec doc update was already merged a while ago [1].
>> Implementation is
>> >>>>>>>>>> ongoing in this PR [2].
>> >>>>>>>>>>
>> >>>>>>>>>> [1] https://iceberg.apache.org/spec/#default-values
>> >>>>>>>>>> [2] https://github.com/apache/iceberg/pull/9502
>> >>>>>>>>>>
>> >>>>>>>>>> Thanks,
>> >>>>>>>>>> Walaa.
>> >>>>>>>>>>
>> >>>>>>>>>> On Wed, Jul 31, 2024 at 10:52 AM Russell Spitzer
>> >>>>>>>>>> <russell.spit...@gmail.com> wrote:
>> >>>>>>>>>> >
>> >>>>>>>>>> > Thanks for bringing this up, I would say that from my
>> perspective I have time to really push through hopefully two things
>> >>>>>>>>>> >
>> >>>>>>>>>> > Variant Type and
>> >>>>>>>>>> > Row Lineage (which I will have a proposal for on the mailing
>> list next week)
>> >>>>>>>>>> >
>> >>>>>>>>>> > I'm using the Project to try to track logistics and minutia
>> required for the new spec version but I would like to bring other work in
>> there as well so we can get a clear picture of what is actually being
>> actively worked on.
>> >>>>>>>>>> >
>> >>>>>>>>>> > On Wed, Jul 31, 2024 at 12:27 PM Jacob Marble <
>> jacobmar...@influxdata.com> wrote:
>> >>>>>>>>>> >>
>> >>>>>>>>>> >> Good morning,
>> >>>>>>>>>> >>
>> >>>>>>>>>> >> To continue the community sync today when format version 3
>> was discussed.
>> >>>>>>>>>> >>
>> >>>>>>>>>> >> Questions answered by consensus:
>> >>>>>>>>>> >> - Format version releases should _not_ be tied to Iceberg
>> version releases.
>> >>>>>>>>>> >> - Several planned features will require format version
>> releases; the process shouldn't be onerous.
>> >>>>>>>>>> >>
>> >>>>>>>>>> >> Unanswered questions:
>> >>>>>>>>>> >> - What will be included in format version 3?
>> >>>>>>>>>> >>   - What is a reasonable target date?
>> >>>>>>>>>> >>   - How to track progress? Today, there are two public
>> lists:
>> >>>>>>>>>> >>     - GH milestone:
>> https://github.com/apache/iceberg/milestone/42
>> >>>>>>>>>> >>     - GH project:
>> https://github.com/orgs/apache/projects/377
>> >>>>>>>>>> >> - What is required of a feature in order to be included in
>> any adopted format version?
>> >>>>>>>>>> >>   - At least one complete reference implementation should
>> exist.
>> >>>>>>>>>> >>     - Java is the reference implementation by convention;
>> that's OK, but not perfect. Should Java be the reference implementation by
>> mandate?
>> >>>>>>>>>> >>
>> >>>>>>>>>> >> Have I missed anything?
>> >>>>>>>>>> >>
>> >>>>>>>>>> >> --
>> >>>>>>>>>> >> Jacob Marble
>> >>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> Ryan Blue
>> >>>> Databricks
>> >>
>> >>
>> >>
>> >> --
>> >> Ryan Blue
>> >> Databricks
>>
>
>
> --
> Ryan Blue
> Databricks
>


-- 
Ryan Blue
Databricks

Re: [DISCUSS] adoption of format version 3

Reply via email to