Re: [DISCUSS] Future of Parquet Versioning

Alkis Evlogimenos via dev Thu, 11 Jun 2026 17:42:36 -0700

Great discussion. It looks like we are converging on the important part:
features need to be bundled. In the threads we used version/preset/epoch; I
will call it a bundle from here on. A bundle is a frozen set of features. A
bundle draws a clear line where we have:


1. the ability to co-land features coherently
2. a clear path to deprecate features
3. a clean UX for writes (target=bundle-2026/v3/whatever)

To make this work we need to ratify that writers declare the bundle when a
file uses its features, and readers fail the read if they don't support it.
This is the hybrid design Delta and Iceberg are converging on from opposite
directions btw: Delta moved from monolithic protocol versions to feature
flags plus named bundles, Iceberg is discussing decoupling features from
its voted versions.

Encodings already work like feature flags: a reader that can't decode an
encoding fails with a clear, local error. They have a safe path forward
today, so they can stay out of band. The tradeoff is that "supports bundle
X" won't cover encodings.

The part we need to solve is structural changes: deprecating
path_in_schema, non-contiguous pages, a new footer. These have no clean
failure mode in deployed readers. The version in the footer is
not salvageable as a signal. But it is a required field and we can use it
to poison old readers! I propose:

1. Add a bundle field to FileMetaData (optional in thrift for compatibility
with existing files, mandatory to write when the file uses bundle features).
2. Mark FileMetaData.version optional in thrift. A writer that sets the
bundle field omits version. A file carries exactly one of the two.
3. Readers that see the bundle field must support the declared bundle or
fail with an error naming it.

The trick is (2): the deployed readers I checked hard-fail at footer parse
when FileMetaData.version is missing: parquet-java, arrow-cpp, parquet-rs
and DuckDB. They all enforce its presence even though the spec says to
ignore its value. Old readers fail immediately on open instead of tripping
on obscure errors later, or worse, reading bad data.

This is a one time cost and we should pay it as early as possible. Ratify
the first bundle's contents: path_in_schema deprecation plus other small
cleanups. From that point on bundle aware readers fail cleanly on future
bundles.


On Thu, Jun 11, 2026 at 8:28 PM Ryan Blue <[email protected]> wrote:

> Sorry for the duplicate, but Micah said the quote blocks didn't go through
> so I'm re-sending with `>` so that this is more readable.
>
> This is quite a large thread, so hopefully I am not missing any big points
> that have already been settled.
>
> If I understand correctly, it looks like there are some good things that we
> agree on. The most consequential is that we want to bundle features
> together.
>
> From Antoine’s response to my email about presets, I came away with the
> clarification that a preset acts as a bundle of features much like a
> version. This is a big step forward because it would be such a challenge
> for users to reason about every feature individually and to check support
> for features across implementations. The worst outcome, in my opinion,
> would be leaving users or administrators to deal with a wild west of
> feature flags, so I’m glad to see we’re making progress!
>
> This just leaves the details of how we want to manage those feature
> bundles. The preset option is a mechanistic approach to inclusion, while
> the version option relies on building consensus for features to include. I
> think this is a good question to focus on right now because I think we have
> yet to come to a shared understanding of both options in order to discuss
> the trade-offs between them.
>
> With the rest of this email, I’ll try to address some confusion I saw in
> the thread about how versions and/or presets would work. One good thing to
> note is that we’re primarily talking about forward-incompatible changes:
> changes that would cause older readers to fail and/or read data
> incorrectly.
>
> > Antoine: you also don’t know what’s in a version until the version gets
> decided upon
>
> One of the strengths is that you do get to know what’s in a version ahead
> of time. The process is deliberate and predictable.
>
> When we add a new (forward-incompatible) feature it automatically goes into
> the next format version. I think that we would do this using a vote so that
> everyone here gets to take part in the decision. The features that we have
> agreed on are documented in the format for that version and when we want to
> close the version we have another vote to adopt it. Then new changes go
> into the next version.
>
> This procedure ensures that we agree, via community consensus, on what goes
> into a version and we accumulate a list that is predictable for
> implementations to target.
>
> > Antoine: IIRC the basis for this discussion was to inform Parquet writers
> about which features can safely be enabled.
>
> I think this understates the problem and I prefer Andrew’s summary, which
> is that we need a way for readers and writers to coordinate about this
> problem. That’s why writer flags alone are untenable because the number of
> things to coordinate is so large.
>
> There’s also a lot more to it than “can produce bits for X”. This gets to
> the hypothetical posed about Parquet 2.34. There are forward-incompatible
> features that appear readable by older clients but cause them to produce
> incorrect data. For instance, I could add a field to a page header that
> gives an offset that should be added to all values in a page, so that we
> can pack values in smaller bit widths. Older readers would skip the offset
> and produce bad values. That’s one reason why older readers should fail for
> newer feature bundles.
>
> Clearly documenting reader responsibilities is a big part of this work as
> well. If we assume that we will have readers that will fail (which we could
> do), we have to design features that force them to fail. And then we have
> to deal with bad error messages to users. So thinking through how our
> system of working with feature bundles is really important, not just for
> collecting sets of writer flags.
>
> > There is no easy-to-read list of changes unless I am missing something.
>
> Yes, part of the problem is that the table of features was removed, which
> is a big part of what caused the current confusion about what v2 is.
>
> But this is a problem for both ways to manage version bundles, right?
>
> > 2) Let’s say Parquet 2.34 introduces features A and B. Let’s also say a
> > Parquet reader implements feature A but not feature B. What should this
> > reader do if you give it a file that has version 2.34 recorded in the
> > metadata? Should it error out (but perhaps the file only uses feature
> > A)? Or should it not error out (but perhaps the file uses feature B)?
>
> I mentioned this above, but I think this affects both strategies for
> bundling features and doesn’t really distinguish between them.
>
> I will note, though, that we have a clear rule about this in the Iceberg
> community that works well for versions. Readers must fail if they don’t
> recognize a version. If a reader knows about a version and has a feature
> gap, it can look for the missing feature and fail with a good error message
> but otherwise proceed. The main thing is that the feature bundle (version
> in Iceberg) is understood and is the primary way the group is coordinated.
>
> > Historically it’s been quite common to have this kind of jagged feature
> adoption where implementations do not necessarily implement features in the
> chronological order of their appearance in parquet-format.
>
> This is something that we are addressing with this discussion! The goal is
> a way to coordinate between readers and writers, right?
>
> > Readers already error out when then encounter an unknown encoding in a
> column they are asked to reader. What do we gain by having them also check
> a version number?
>
> You cannot guarantee compatibility with reader failures alone, and you
> often want better support for even missing features than you get with
> whatever failure occurs. I think Dan’s doc has a good section on this.
>
> Ryan
>
> On Thu, Jun 11, 2026 at 1:52 AM Antoine Pitrou <[email protected]> wrote:
>
> > Le 10/06/2026 à 16:40, Micah Kornfield a écrit :
> > >
> > > In any case, this does not seem to be solving the problem of "as a
> user,
> > >> how do I enable features safely".
> > >
> > > Can you elaborate?  Every feature listed after 2023, hass the year it
> was
> > > introduced in parenthesis next to it.
> > > I think this in addition to
> > > the  table showing the version that everything was supported in, can
> > give a
> > > user a pretty good idea of what might be safe
> >
> > Ok, so concretely, what is a user supposed to do with these tables?
> >
> > I'm sorry for being so stubborn and insistent, but Parquet files are
> > produced routinely by data scientists and other people with no expert
> > knowledge of Parquet internals.
> >
> > If "how to produce an optimized Parquet file" takes an entire paragraph
> > to explain and requires diving into tables of features, then we haven't
> > solved the problem.
> >
> >
> > (also, even I don't know what to do with the information of "Arrow C++
> > does not support 2025 features": what does it bring to the reader?)
> >
> > Regards
> >
> > Antoine.
> >
> >
> >
>

Re: [DISCUSS] Future of Parquet Versioning

Reply via email to