Re: [DISCUSS] Future of Parquet Versioning

Ryan Blue Thu, 11 Jun 2026 11:27:56 -0700

Sorry for the duplicate, but Micah said the quote blocks didn't go through
so I'm re-sending with `>` so that this is more readable.

This is quite a large thread, so hopefully I am not missing any big points
that have already been settled.

If I understand correctly, it looks like there are some good things that we
agree on. The most consequential is that we want to bundle features
together.

>From Antoine’s response to my email about presets, I came away with the
clarification that a preset acts as a bundle of features much like a
version. This is a big step forward because it would be such a challenge
for users to reason about every feature individually and to check support
for features across implementations. The worst outcome, in my opinion,
would be leaving users or administrators to deal with a wild west of
feature flags, so I’m glad to see we’re making progress!

This just leaves the details of how we want to manage those feature
bundles. The preset option is a mechanistic approach to inclusion, while
the version option relies on building consensus for features to include. I
think this is a good question to focus on right now because I think we have
yet to come to a shared understanding of both options in order to discuss
the trade-offs between them.

With the rest of this email, I’ll try to address some confusion I saw in
the thread about how versions and/or presets would work. One good thing to
note is that we’re primarily talking about forward-incompatible changes:
changes that would cause older readers to fail and/or read data incorrectly.

> Antoine: you also don’t know what’s in a version until the version gets
decided upon

One of the strengths is that you do get to know what’s in a version ahead
of time. The process is deliberate and predictable.

When we add a new (forward-incompatible) feature it automatically goes into
the next format version. I think that we would do this using a vote so that
everyone here gets to take part in the decision. The features that we have
agreed on are documented in the format for that version and when we want to
close the version we have another vote to adopt it. Then new changes go
into the next version.

This procedure ensures that we agree, via community consensus, on what goes
into a version and we accumulate a list that is predictable for
implementations to target.

> Antoine: IIRC the basis for this discussion was to inform Parquet writers
about which features can safely be enabled.

I think this understates the problem and I prefer Andrew’s summary, which
is that we need a way for readers and writers to coordinate about this
problem. That’s why writer flags alone are untenable because the number of
things to coordinate is so large.

There’s also a lot more to it than “can produce bits for X”. This gets to
the hypothetical posed about Parquet 2.34. There are forward-incompatible
features that appear readable by older clients but cause them to produce
incorrect data. For instance, I could add a field to a page header that
gives an offset that should be added to all values in a page, so that we
can pack values in smaller bit widths. Older readers would skip the offset
and produce bad values. That’s one reason why older readers should fail for
newer feature bundles.

Clearly documenting reader responsibilities is a big part of this work as
well. If we assume that we will have readers that will fail (which we could
do), we have to design features that force them to fail. And then we have
to deal with bad error messages to users. So thinking through how our
system of working with feature bundles is really important, not just for
collecting sets of writer flags.

> There is no easy-to-read list of changes unless I am missing something.

Yes, part of the problem is that the table of features was removed, which
is a big part of what caused the current confusion about what v2 is.

But this is a problem for both ways to manage version bundles, right?

> 2) Let’s say Parquet 2.34 introduces features A and B. Let’s also say a
> Parquet reader implements feature A but not feature B. What should this
> reader do if you give it a file that has version 2.34 recorded in the
> metadata? Should it error out (but perhaps the file only uses feature
> A)? Or should it not error out (but perhaps the file uses feature B)?

I mentioned this above, but I think this affects both strategies for
bundling features and doesn’t really distinguish between them.

I will note, though, that we have a clear rule about this in the Iceberg
community that works well for versions. Readers must fail if they don’t
recognize a version. If a reader knows about a version and has a feature
gap, it can look for the missing feature and fail with a good error message
but otherwise proceed. The main thing is that the feature bundle (version
in Iceberg) is understood and is the primary way the group is coordinated.

> Historically it’s been quite common to have this kind of jagged feature
adoption where implementations do not necessarily implement features in the
chronological order of their appearance in parquet-format.

This is something that we are addressing with this discussion! The goal is
a way to coordinate between readers and writers, right?

> Readers already error out when then encounter an unknown encoding in a
column they are asked to reader. What do we gain by having them also check
a version number?

You cannot guarantee compatibility with reader failures alone, and you
often want better support for even missing features than you get with
whatever failure occurs. I think Dan’s doc has a good section on this.

Ryan

On Thu, Jun 11, 2026 at 1:52 AM Antoine Pitrou <[email protected]> wrote:

> Le 10/06/2026 à 16:40, Micah Kornfield a écrit :
> >
> > In any case, this does not seem to be solving the problem of "as a user,
> >> how do I enable features safely".
> >
> > Can you elaborate?  Every feature listed after 2023, hass the year it was
> > introduced in parenthesis next to it.
> > I think this in addition to
> > the  table showing the version that everything was supported in, can
> give a
> > user a pretty good idea of what might be safe
>
> Ok, so concretely, what is a user supposed to do with these tables?
>
> I'm sorry for being so stubborn and insistent, but Parquet files are
> produced routinely by data scientists and other people with no expert
> knowledge of Parquet internals.
>
> If "how to produce an optimized Parquet file" takes an entire paragraph
> to explain and requires diving into tables of features, then we haven't
> solved the problem.
>
>
> (also, even I don't know what to do with the information of "Arrow C++
> does not support 2025 features": what does it bring to the reader?)
>
> Regards
>
> Antoine.
>
>
>

Re: [DISCUSS] Future of Parquet Versioning

Reply via email to