Re: [DISCUSS] Future of Parquet Versioning

Ryan Blue Thu, 11 Jun 2026 11:00:57 -0700

This is quite a large thread, so hopefully I am not missing any big points
that have already been settled.

If I understand correctly, it looks like there are some good things that we
agree on. The most consequential is that we want to bundle features
together.

>From Antoine’s response to my email about presets, I came away with the
clarification that a preset acts as a bundle of features much like a
version. This is a big step forward because it would be such a challenge
for users to reason about every feature individually and to check support
for features across implementations. The worst outcome, in my opinion,
would be leaving users or administrators to deal with a wild west of
feature flags, so I’m glad to see we’re making progress!

This just leaves the details of how we want to manage those feature
bundles. The preset option is a mechanistic approach to inclusion, while
the version option relies on building consensus for features to include. I
think this is a good question to focus on right now because I think we have
yet to come to a shared understanding of both options in order to discuss
the trade-offs between them.

With the rest of this email, I’ll try to address some confusion I saw in
the thread about how versions and/or presets would work. One good thing to
note is that we’re primarily talking about forward-incompatible changes:
changes that would cause older readers to fail and/or read data incorrectly.

Antoine: you also don’t know what’s in a version until the version gets
decided upon

One of the strengths is that you *do* get to know what’s in a version ahead
of time. The process is deliberate and predictable.

When we add a new (forward-incompatible) feature it automatically goes into
the next format version. I think that we would do this using a vote so that
everyone here gets to take part in the decision. The features that we have
agreed on are documented in the format for that version and when we want to
close the version we have another vote to adopt it. Then new changes go
into the next version.

This procedure ensures that we agree, via community consensus, on what goes
into a version and we accumulate a list that is predictable for
implementations to target.

Antoine: IIRC the basis for this discussion was to inform Parquet
*writers* about
which features can safely be enabled.

I think this understates the problem and I prefer Andrew’s summary, which
is that we need a way for readers and writers to coordinate about this
problem. That’s why writer flags alone are untenable because the number of
things to coordinate is so large.

There’s also a lot more to it than “can produce bits for X”. This gets to
the hypothetical posed about Parquet 2.34. There are forward-incompatible
features that appear readable by older clients but cause them to produce
incorrect data. For instance, I could add a field to a page header that
gives an offset that should be added to all values in a page, so that we
can pack values in smaller bit widths. Older readers would skip the offset
and produce bad values. That’s one reason why older readers should fail for
newer feature bundles.

Clearly documenting reader responsibilities is a big part of this work as
well. If we assume that we will have readers that will fail (which we
*could* do), we have to design features that force them to fail. And then
we have to deal with bad error messages to users. So thinking through how
our system of working with feature bundles is really important, not just
for collecting sets of writer flags.

There is no easy-to-read list of changes unless I am missing something.

Yes, part of the problem is that the table of features was removed, which
is a big part of what caused the current confusion about what v2 is.

But this is a problem for both ways to manage version bundles, right?

2) Let’s say Parquet 2.34 introduces features A and B. Let’s also say a
Parquet reader implements feature A but not feature B. What should this
reader do if you give it a file that has version 2.34 recorded in the
metadata? Should it error out (but perhaps the file only uses feature
A)? Or should it not error out (but perhaps the file uses feature B)?

I mentioned this above, but I think this affects both strategies for
bundling features and doesn’t really distinguish between them.

I will note, though, that we have a clear rule about this in the Iceberg
community that works well for versions. Readers must fail if they don’t
recognize a version. If a reader knows about a version and has a feature
gap, it can look for the missing feature and fail with a good error message
but otherwise proceed. The main thing is that the feature bundle (version
in Iceberg) is understood and is the primary way the group is coordinated.

Historically it’s been quite common to have this kind of jagged feature
adoption where implementations do not necessarily implement features in the
chronological order of their appearance in parquet-format.

This is something that we are addressing with this discussion! The goal is
a way to coordinate between readers and writers, right?

Readers *already* error out when then encounter an unknown encoding in a
column they are asked to reader. What do we gain by having them *also* check
a version number?

You cannot guarantee compatibility with reader failures alone, and you
often want better support for even missing features than you get with
whatever failure occurs. I think Dan’s doc has a good section on this.

Ryan

On Wed, Jun 10, 2026 at 3:25 PM Ryan Blue <[email protected]> wrote:

> This is quite a large thread, so hopefully I am not missing any big points
> that have already been settled.
>
> If I understand correctly, it looks like there are some good things that
> we agree on. The most consequential is that we want to bundle features
> together.
>
> From Antoine’s response to my email about presets, I came away with the
> clarification that a preset acts as a bundle of features much like a
> version. This is a big step forward because it would be such a challenge
> for users to reason about every feature individually and to check support
> for features across implementations. The worst outcome, in my opinion,
> would be leaving users or administrators to deal with a wild west of
> feature flags, so I’m glad to see we’re making progress!
>
> This just leaves the details of how we want to manage those feature
> bundles. The preset option is a mechanistic approach to inclusion, while
> the version option relies on building consensus for features to include. I
> think this is a good question to focus on right now because I think we have
> yet to come to a shared understanding of both options in order to discuss
> the trade-offs between them.
>
> With the rest of this email, I’ll try to address some confusion I saw in
> the thread about how versions and/or presets would work. One good thing to
> note is that we’re primarily talking about forward-incompatible changes:
> changes that would cause older readers to fail and/or read data incorrectly.
>
> Antoine: you also don’t know what’s in a version until the version gets
> decided upon
>
> One of the strengths is that you *do* get to know what’s in a version
> ahead of time. The process is deliberate and predictable.
>
> When we add a new (forward-incompatible) feature it automatically goes
> into the next format version. I think that we would do this using a vote so
> that everyone here gets to take part in the decision. The features that we
> have agreed on are documented in the format for that version and when we
> want to close the version we have another vote to adopt it. Then new
> changes go into the next version.
>
> This procedure ensures that we agree, via community consensus, on what
> goes into a version and we accumulate a list that is predictable for
> implementations to target.
>
> Antoine: IIRC the basis for this discussion was to inform Parquet
> *writers* about which features can safely be enabled.
>
> I think this understates the problem and I prefer Andrew’s summary, which
> is that we need a way for readers and writers to coordinate about this
> problem. That’s why writer flags alone are untenable because the number of
> things to coordinate is so large.
>
> There’s also a lot more to it than “can produce bits for X”. This gets to
> the hypothetical posed about Parquet 2.34. There are forward-incompatible
> features that appear readable by older clients but cause them to produce
> incorrect data. For instance, I could add a field to a page header that
> gives an offset that should be added to all values in a page, so that we
> can pack values in smaller bit widths. Older readers would skip the offset
> and produce bad values. That’s one reason why older readers should fail for
> newer feature bundles.
>
> Clearly documenting reader responsibilities is a big part of this work as
> well. If we assume that we will have readers that will fail (which we
> *could* do), we have to design features that force them to fail. And then
> we have to deal with bad error messages to users. So thinking through how
> our system of working with feature bundles is really important, not just
> for collecting sets of writer flags.
>
> There is no easy-to-read list of changes unless I am missing something.
>
> Yes, part of the problem is that the table of features was removed, which
> is a big part of what caused the current confusion about what v2 is.
>
> But this is a problem for both ways to manage version bundles, right?
>
> 2) Let’s say Parquet 2.34 introduces features A and B. Let’s also say a
> Parquet reader implements feature A but not feature B. What should this
> reader do if you give it a file that has version 2.34 recorded in the
> metadata? Should it error out (but perhaps the file only uses feature
> A)? Or should it not error out (but perhaps the file uses feature B)?
>
> I mentioned this above, but I think this affects both strategies for
> bundling features and doesn’t really distinguish between them.
>
> I will note, though, that we have a clear rule about this in the Iceberg
> community that works well for versions. Readers must fail if they don’t
> recognize a version. If a reader knows about a version and has a feature
> gap, it can look for the missing feature and fail with a good error message
> but otherwise proceed. The main thing is that the feature bundle (version
> in Iceberg) is understood and is the primary way the group is coordinated.
>
> Historically it’s been quite common to have this kind of jagged feature
> adoption where implementations do not necessarily implement features in the
> chronological order of their appearance in parquet-format.
>
> This is something that we are addressing with this discussion! The goal is
> a way to coordinate between readers and writers, right?
>
> Readers *already* error out when then encounter an unknown encoding in a
> column they are asked to reader. What do we gain by having them *also*
> check a version number?
>
> You cannot guarantee compatibility with reader failures alone, and you
> often want better support for even missing features than you get with
> whatever failure occurs. I think Dan’s doc has a good section on this.
>
> Ryan
>
> On Wed, Jun 10, 2026 at 5:57 AM Antoine Pitrou <[email protected]> wrote:
>
>>
>> Le 10/06/2026 à 13:14, Andrew Lamb a écrit :
>> >   > The problem I think we're trying to solve is to make it easier and
>> safer
>> >> for users to enable modern features that produce more optimized and
>> more
>> >> efficient Parquet files.
>> >
>> > I agree
>> >
>> >> I agree with the "non-trivial consensus" problem, and that's the point
>> >> of calendar-based presets: they eschew the need for "non-trivial
>> >> consensus" as they are based on actual adoption. :-)
>> >
>> > To be clear I am not opposed to presets (or some other schemes to make
>> > adoption clearer)
>> >
>> > In fact, as perhaps you are hinting at, the implementation status
>> page[1]
>> > already has a table with yearly adoption ("Minimum Version for Read
>> Support
>> > by Year"). Perhaps that is enough
>>
>> I don't understand what this table means or how it's supposed to be
>> utilized. What is a "2025 Feature"? Why does Arrow C++ not support "2025
>> Features"?
>>
>> In any case, this does not seem to be solving the problem of "as a user,
>> how do I enable features safely".
>>
>> Regards
>>
>> Antoine.
>>
>>
>>

Re: [DISCUSS] Future of Parquet Versioning

Reply via email to