Re: [DISCUSS] adoption of format version 3

Ryan Blue Tue, 06 Aug 2024 16:41:33 -0700

I’ve been going through the list I’ve accumulated for v3 changes and I
think we do have a fairly clear set of things that people are working on.
There are two main areas. The first is centered around types and extending
existing metadata:


   - Add new types: timestamp(ns), variant, blob, and null
   - Add new type promotion: long to timestamp,
   boolean/int/long/date/time/timestamp/uuid to string, null to anything, most
   types to variant
   - Add default value support via initial-default, write-default
   - Add multi-arg transforms (multi-column bucket, zorder)

Then there are a few bigger items that have people actively working:

   - Row-level tracking metadata
   - Improvements for position delete performance
   - Encryption metadata
   - Geo support: geometry type, xz transform, and geo predicates

I propose that we target the first set of things since that’s a group of
similar changes. It makes sense (at least to me) to add new types in a
group, and it also makes sense to extend type capabilities (defaults and
promotions) at the same time. (Also, we can choose to exclude blob if it is
a large amount of work)

I’d include multi-arg transforms in that group since the design is well
written and nearly done. And we can make sure that there is a
backward-compatible way to add new transforms between major releases. The
Java library can currently handle new transforms and if we get those
details into the spec then we don’t need to get the specifics of multi-arg
bucketing as part of the v3 release.

For the second group of projects, I suggest that we continue to actively
work on them and try to get at least 2 of them in. Encryption metadata is
quite close and just needs a few table-level additions to the metadata
file. The changes for row-level tracking and position delete performance
should be reasonably sized.

I’d also love to see the geo support in v3, but that’s also a well-scoped
feature that could be a v4 if it isn’t going to make it in time. My main
concern here is the size of the changes where I don’t have much context.

In summary, I’d say we should aim to include the new types, promotion,
default values, and multi-arg transforms. Then include any of the larger
items that are ready in time. Does that sound reasonable?

Ryan

On Mon, Aug 5, 2024 at 3:20 PM Micah Kornfield <[email protected]>
wrote:

> I suggest keeping those things separate — Micah, would you mind starting a
>> separate thread so this one can focus on v3?
>
>
> Yes I'll start another thread on this post V3, to allow for focus on
> closing off V3 with the current process (and see if there is interest in
> trying something new for v4.
>
> Thanks,
> Micah
>
> On Mon, Aug 5, 2024 at 12:17 PM Ryan Blue <[email protected]>
> wrote:
>
>> At least for discussion purposes, I think the REST spec (and any spec
>> that involves code that will ultimately be consumed) is probably a harder
>> conversation.
>>
>> I agree that it’s a very different conversation and probably out of scope
>> for the table v3 spec.
>>
>> I’m undecided if minor releases are necessary for non-code specs, this
>> seems like it might be too much overhead and might not provide a ton of
>> value (maybe you could elaborate on the value you see in it?).
>>
>> Thanks for bringing up the point about minor versions. It’s critical to
>> keep in mind that we’re talking about two different types of changes. For
>> the v3 discussion, I think the question is what changes we want to add in
>> v3, which is an opportunity to group together forward-incompatible changes
>> that require new behavior to read tables correctly.
>>
>> It’s great to discuss whether we want to change how we version the spec
>> and see if we want to release breaking changes more often. I *think*
>> that was Micah’s original intent for bringing up a regular release cadence
>> for the spec. But we should also be aware that this is a separate
>> discussion. Most of the points that Micah raised are covered by our
>> existing process for new *major* versions:
>>
>>    1. Add changes to the spec such that they are clearly attached to a
>>    future version
>>    2. Implement the changes in at least one implementation, probably the
>>    reference implementation
>>    3. When we have accumulated enough breaking changes, vote to adopt
>>    the new version
>>
>> There are differences that we may choose to change, like adding the
>> changes to the spec rather than keeping them in PRs. And we may want to
>> introduce a regular cadence to make the last step more predictable. Those
>> are great discussions to have, but right now we know that we have changes
>> we want to get into a v3 in the next few months. I suggest keeping those
>> things separate — Micah, would you mind starting a separate thread so this
>> one can focus on v3?
>>
>> I also see that if we were to go with Micah’s suggestion, it has an
>> impact on the decisions that we need to make for the v3 release. But I
>> think that even if we were to have a regular release cadence, it would
>> still make sense to group features like new types together because it makes
>> the versions easier to understand and limits the overall impact in the
>> implementations.
>>
>> Ryan
>>
>> On Fri, Aug 2, 2024 at 11:39 AM Micah Kornfield <[email protected]>
>> wrote:
>>
>>> I have been a big advocate for releasing all the Iceberg specs
>>>> regularly, and just follow a normal product release cycle with major and
>>>> minor releases. I touched a bit of the reasoning in the thread for fixing
>>>> stats fields in REST spec [1]. This helps a lot with engines that do not
>>>> use any Iceberg open source library and just look at a spec and implement
>>>> it. With a regular release, they can have a stable version to look into,
>>>> rather than a spec that is changing all the time within the same version.
>>>
>>>
>>> At least for discussion purposes, I think the REST spec (and any spec
>>> that involves code that will ultimately be consumed) is probably a harder
>>> conversation.  I'm undecided if minor releases are necessary for non-code
>>> specs, this seems like it might be too much overhead and might not provide
>>> a ton of value (maybe you could elaborate on the value you see in it?).
>>>
>>>
>>>> I think Fokko brought up a point that "this will introduce a process
>>>> that will slow the evolution down", which is true because you need to spend
>>>> additional effort and release it. And without a reference implementation,
>>>> it is hard to say if the spec is mature enough to be released, which again
>>>> makes it potentially tied to the release cycle of at least the Java 
>>>> library.
>>>
>>>
>>> Sorry I think I missed Fokko's argument on the linked thread.  In my
>>> mind, the order of operations on non-code spec changes would be:
>>>
>>> 1.  Spec change is proposed/reviewed and agreed upon but not merged.
>>> 2.  Reference implementation happens (possibly with revisions if
>>> implementation challenges arise).
>>> 3.  Reference implementation is merged
>>> 4.  Spec change is merged.
>>> 5.  Spec is officially  "released" at some normal cadence (or in theory
>>> it could be done immediately).
>>>
>>> Steps 3 and 4 could happen simultaneously, or 4 could potentially have
>>> some lag to it to allow for further feedback (i.e. letting reference
>>> implementation be released) and revision.
>>>
>>> If step 5 is done immediately after step 4, I don't think this would
>>> slow down evolution (but comes at the cost of more versions).  Part of step
>>> five would necessitate changing code for any incomplete implementations to
>>> only be turned on in the next revision (or larger features could be worked
>>> on in a separate branch to avoid this complication).
>>>
>>> Thanks,
>>> Micah
>>>
>>>
>>>
>>>
>>> On Fri, Aug 2, 2024 at 9:10 AM Jack Ye <[email protected]> wrote:
>>>
>>>> > An alternative view: Would it make sense to start releasing the table
>>>> specification on a regular cadence (e.g. quarterly, every 6 months or
>>>> yearly)?
>>>>
>>>> I have been a big advocate for releasing all the Iceberg specs
>>>> regularly, and just follow a normal product release cycle with major and
>>>> minor releases. I touched a bit of the reasoning in the thread for fixing
>>>> stats fields in REST spec [1]. This helps a lot with engines that do not
>>>> use any Iceberg open source library and just look at a spec and implement
>>>> it. With a regular release, they can have a stable version to look into,
>>>> rather than a spec that is changing all the time within the same version.
>>>>
>>>> It is important to note that minor spec versions will not be leveraged
>>>> in implementations like how we have logics right now for switching
>>>> behaviors depending on major versions. It is purely for the purpose of
>>>> making more incremental progress on the spec, and providing stable spec
>>>> versions for other reference implementations. Otherwise, the branches in
>>>> the codebase to handle different versions easily get out of control.
>>>>
>>>> I think Fokko brought up a point that "this will introduce a process
>>>> that will slow the evolution down", which is true because you need to spend
>>>> additional effort and release it. And without a reference implementation,
>>>> it is hard to say if the spec is mature enough to be released, which again
>>>> makes it potentially tied to the release cycle of at least the Java 
>>>> library.
>>>>
>>>> Curious what people think.
>>>>
>>>> Best,
>>>> Jack Ye
>>>>
>>>> [1] https://lists.apache.org/thread/v6x772v9sgo0xhpwmh4br756zhbgomtf
>>>>
>>>> On Wed, Jul 31, 2024 at 10:19 PM Micah Kornfield <[email protected]>
>>>> wrote:
>>>>
>>>>> It sounds like most of the opinions so far are waiting for the scope
>>>>> of work to finish before finalizing the specification.
>>>>>
>>>>> An alternative view: Would it make sense to start releasing the table
>>>>> specification on a regular cadence (e.g. quarterly, every 6 months or
>>>>> yearly)?  I think the problem with waiting for features to get in is that
>>>>> priorities change and things take longer than expected, thus leaving the
>>>>> actual finalization of the specification in limbo and probably adds to
>>>>> project management overhead.   If the specification is released regularly
>>>>> then it means features can always be included in the next release without
>>>>> too much delay hopefully.  The main downside I can think of in this
>>>>> approach is having to have more branches in code to handle different
>>>>> versions.
>>>>>
>>>>> One corollary to this approach is spec changes shouldn't be merged
>>>>> before their implementations are ready.
>>>>>
>>>>>   - At least one complete reference implementation should exist.
>>>>>
>>>>>
>>>>> For more complicated features I think at some point soon it might be
>>>>> worth considering two implementations (or at least 1 full implementation
>>>>> and 1 read only implementation) to make sure there aren't compatibility
>>>>> issues/misunderstandings in the specification (e.g. I think Variant and
>>>>> Geography fall into this category).
>>>>>
>>>>> Cheers,
>>>>> Micah
>>>>>
>>>>> On Wed, Jul 31, 2024 at 12:47 PM Russell Spitzer <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> I think this all sounds good, the real question is whether or not we
>>>>>> have someone to actively work on the proposals. I think for things like
>>>>>> Default Values and Geo Types we have folks actively working on them so 
>>>>>> it's
>>>>>> not a big deal.
>>>>>>
>>>>>> On Wed, Jul 31, 2024 at 2:09 PM Szehon Ho <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Sorry I missed the sync this morning (sick), I'd like to push for
>>>>>>> geo too.
>>>>>>>
>>>>>>> I think on this front as per the last sync, Ryan recommended to wait
>>>>>>> for Parquet support to land, to avoid having two versions on Iceberg 
>>>>>>> side
>>>>>>> (Iceberg-native vs Parquet-native).  Parquet support is being actively
>>>>>>> worked on iiuc: https://github.com/apache/parquet-format/pull/240
>>>>>>> .  But it would bind V3 to the parquet-format release timeline, unless 
>>>>>>> we
>>>>>>> start with iceberg-native support first and move later (as we originally
>>>>>>> proposed).
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Szehon
>>>>>>>
>>>>>>> On Wed, Jul 31, 2024 at 10:58 AM Walaa Eldin Moustafa <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Another feature that was planned for V3 is support for default
>>>>>>>> values.
>>>>>>>> Spec doc update was already merged a while ago [1]. Implementation
>>>>>>>> is
>>>>>>>> ongoing in this PR [2].
>>>>>>>>
>>>>>>>> [1] https://iceberg.apache.org/spec/#default-values
>>>>>>>> [2] https://github.com/apache/iceberg/pull/9502
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Walaa.
>>>>>>>>
>>>>>>>> On Wed, Jul 31, 2024 at 10:52 AM Russell Spitzer
>>>>>>>> <[email protected]> wrote:
>>>>>>>> >
>>>>>>>> > Thanks for bringing this up, I would say that from my perspective
>>>>>>>> I have time to really push through hopefully two things
>>>>>>>> >
>>>>>>>> > Variant Type and
>>>>>>>> > Row Lineage (which I will have a proposal for on the mailing list
>>>>>>>> next week)
>>>>>>>> >
>>>>>>>> > I'm using the Project to try to track logistics and minutia
>>>>>>>> required for the new spec version but I would like to bring other work 
>>>>>>>> in
>>>>>>>> there as well so we can get a clear picture of what is actually being
>>>>>>>> actively worked on.
>>>>>>>> >
>>>>>>>> > On Wed, Jul 31, 2024 at 12:27 PM Jacob Marble <
>>>>>>>> [email protected]> wrote:
>>>>>>>> >>
>>>>>>>> >> Good morning,
>>>>>>>> >>
>>>>>>>> >> To continue the community sync today when format version 3 was
>>>>>>>> discussed.
>>>>>>>> >>
>>>>>>>> >> Questions answered by consensus:
>>>>>>>> >> - Format version releases should _not_ be tied to Iceberg
>>>>>>>> version releases.
>>>>>>>> >> - Several planned features will require format version releases;
>>>>>>>> the process shouldn't be onerous.
>>>>>>>> >>
>>>>>>>> >> Unanswered questions:
>>>>>>>> >> - What will be included in format version 3?
>>>>>>>> >>   - What is a reasonable target date?
>>>>>>>> >>   - How to track progress? Today, there are two public lists:
>>>>>>>> >>     - GH milestone:
>>>>>>>> https://github.com/apache/iceberg/milestone/42
>>>>>>>> >>     - GH project: https://github.com/orgs/apache/projects/377
>>>>>>>> >> - What is required of a feature in order to be included in any
>>>>>>>> adopted format version?
>>>>>>>> >>   - At least one complete reference implementation should exist.
>>>>>>>> >>     - Java is the reference implementation by convention; that's
>>>>>>>> OK, but not perfect. Should Java be the reference implementation by 
>>>>>>>> mandate?
>>>>>>>> >>
>>>>>>>> >> Have I missed anything?
>>>>>>>> >>
>>>>>>>> >> --
>>>>>>>> >> Jacob Marble
>>>>>>>>
>>>>>>>
>>
>> --
>> Ryan Blue
>> Databricks
>>
>

-- 
Ryan Blue
Databricks

Re: [DISCUSS] adoption of format version 3

Reply via email to