Re: [DISCUSS] Future of Parquet Versioning

Micah Kornfield Thu, 11 Jun 2026 23:01:03 -0700

Hi Alkis and Ryan

That’s why writer flags alone are untenable because the number of
> things to coordinate is so large.




> 1. the ability to co-land features coherently
> 2. a clear path to deprecate features
> 3. a clean UX for writes (target=bundle-2026/v3/whatever)


I think we are conflating a few issues that don't need to be.  There is the
UX component which allows users to understand what bundle of features they
are going to write, and then there is the reader/writer compatibility
component.  Making a "bundle" marker load bearing in the actual file is
something I think we should avoid because there can be orthogonal features
that we don't want to impose a linear relationship on.  Computers are quite
capable of handling low 10s to even 100s of bits of information, so at the
physical layer I don't think having fine grained features for indicating
compatibility should be an issue.

In particular for 1, releasing features frequently is a key requirement for
me.  Proposals that make it difficult to release orthogonal changes or add
ambiguity or friction for what constitutes a released feature is something
that I don't think benefits the project in its current state.

For settling the UX issues I think there are two related issues:

a.  Versioning of the parquet specification releases. I think there are two
paths forward here:
          1.  We can try to follow SemVer where every forward incompatible
feature bumps the major version.  Given the desire to release frequently
this might mean at least a few years that the major version bump happens
multiple times.
          2.  We can keep the existing version as a functional counter
(either by keeping the minor version bump going forward or moving to a
major bump on every version).

b.  Determining feature "bundle" naming for end users.  Options:
      - Version number unrelated to the specification:
             - Pros:
                 - Better marketing?
                 - One of the strengths is that you do get to know what’s
in a version ahead of time. The process is deliberate and predictable.
                     -  @Ryan and @Russell, IIUC this is what you are
advocating for, I'm not sure I fully understand how this is or would be
different from the parquet specification release versions, or why the
deliberateness is better for the project.  Could you expand on your
reasoning?
             - Cons:
                  - Extra discussion on what and when a new version has
accumulated enough features to be released. Otherwise no benefit over
specification version

      - Specification version -
               - Pros:
                  - Single source of truth, no extra process to determine.
               - Cons:
                  - The version does not help a user understand potential
risk of using a feature without actually consulting documentation as to
what is in a particular version.

      - Features bundled by date (e.e.g year, quarter or month) of release
of the parquet specification:
              - Pros:
                 - Unambiguous decision how to bundle features
                 - Provides users some sense of how long the feature set
has been in the ecosystem.
              - Cons:
                  - Might not reflect adoption across major implementations.

      - Features  bundled by implementation adoption in a time frame (as I
understand it this is the presets proposal)
            - Pros:
                - Unambiguous version on how to bundle features once we
actually determine the criteria for inclusion.
                - Is a better empirical sense of adoption of the feature (I
would argue possibly marginally so, given that there is a very large
proliferation of
                     implementations already)
             - Cons:
                - More maintenance, if a particular implementation in the
rubric doesn't have enough contributors to push it forward there needs to
be another decision to not include new features in the preset or drop the
reader.
               -  We still need to discuss the set of implementations.


Based on the pros and cons, I think I would still lean towards Feature
bundled by the date of the parquet specification release, but I think I
might be waiting for some requirements over others, so I'd still like to
suss out what the actual requirements are.



> The trick is (2): the deployed readers I checked hard-fail at footer parse
> when FileMetaData.version is missing: parquet-java, arrow-cpp, parquet-rs
> and DuckDB. They all enforce its presence even though the spec says to
> ignore its value. Old readers fail immediately on open instead of tripping
> on obscure errors later, or worse, reading bad data.


Deployed readers still get an obscure thrift error with this.  Which if we
don't care about, I'm not sure I see the reason for holding up the
path_in_schema field change, as it produces the same experience.

The part we need to solve is structural changes: deprecating
> path_in_schema, non-contiguous pages, a new footer. These have no clean
> failure mode in deployed readers. The version in the footer is
> not salvageable as a signal. But it is a required field and we can use it
> to poison old readers! I propose:


There’s also a lot more to it than “can produce bits for X”. This gets to
> the hypothetical posed about Parquet 2.34. There are forward-incompatible
> features that appear readable by older clients but cause them to produce
> incorrect data. For instance, I could add a field to a page header that
> gives an offset that should be added to all values in a page, so that we
> can pack values in smaller bit widths. Older readers would skip the offset
> and produce bad values. That’s one reason why older readers should fail for
> newer feature bundles.


Here is an alternative that I think covers both of these in a
more elegant way.  For any new forward incompatible structural change we
require it be written to parquet with a new magic number.  Specifically we
do a one time bump with a new magic number (e.g. "PARX" for Parquet
extended).  The structure of a PARX file is as follows:

PARX <Parquet File Content> <Breaking Structural Feature Tags or Bitmap>
PARX

- Parquet File Content - Content as it exists today between the PAR1
header/footer maging numbers (or mutates based on additions to the
bitmap/tags defined after it.  e.g. there would be a marker if there is a
new field added to a page header to parse)
- Breaking Structural Feature Tags or Bitmap - Tracks breaking structural
changes or any change that is not already detectable by existing metadata
(e.g. added encodings would not be registered here).

Each breaking feature would receive a new tag or bitmap, I think the exact
details can be deferred until we have alignment. This would provide a
slightly better error message "unrecognized parquet file" or "not a parquet
file".

 It also means that for features that are future compatible we can continue
to use existing PAR1 without breaking a lot of older readers. Further, it
means older readers have a higher chance of not breaking if a user requests
a file written with a "higher" feature bundle than what they know about but
the writer did not end up using any forward-incompatible features.  This is
important for two reasons:
-  As anecdata, I've personally been bitten by overly stringent validation
on a version that did not effectively serve a purpose other than to check
for numerical equality of the the feature number
- If bumps up to a new bundle accidentally (not all of the parquet readers
the files are shared with support it), minimizing the blast radius can
reduce the time to rewrite files back to an acceptable set of features.


: Delta moved from monolithic protocol versions to feature
> flags plus named bundles,


 I think the Delta model of fine-grained features is better suited for
embedding in the file.  I'm not actually sure there are feature bundles
except maybe as implementation details of specific releases, but I could be
mistaken.


[1] https://github.com/delta-io/delta/blob/master/PROTOCOL.md






On Thu, Jun 11, 2026 at 5:43 PM Alkis Evlogimenos via dev <
[email protected]> wrote:

> Great discussion. It looks like we are converging on the important part:
> features need to be bundled. In the threads we used version/preset/epoch; I
> will call it a bundle from here on. A bundle is a frozen set of features. A
> bundle draws a clear line where we have:
>
> 1. the ability to co-land features coherently
> 2. a clear path to deprecate features
> 3. a clean UX for writes (target=bundle-2026/v3/whatever)
>
> To make this work we need to ratify that writers declare the bundle when a
> file uses its features, and readers fail the read if they don't support it.
> This is the hybrid design Delta and Iceberg are converging on from opposite
> directions btw: Delta moved from monolithic protocol versions to feature
> flags plus named bundles, Iceberg is discussing decoupling features from
> its voted versions.
>
> Encodings already work like feature flags: a reader that can't decode an
> encoding fails with a clear, local error. They have a safe path forward
> today, so they can stay out of band. The tradeoff is that "supports bundle
> X" won't cover encodings.
>
> The part we need to solve is structural changes: deprecating
> path_in_schema, non-contiguous pages, a new footer. These have no clean
> failure mode in deployed readers. The version in the footer is
> not salvageable as a signal. But it is a required field and we can use it
> to poison old readers! I propose:
>
> 1. Add a bundle field to FileMetaData (optional in thrift for compatibility
> with existing files, mandatory to write when the file uses bundle
> features).
> 2. Mark FileMetaData.version optional in thrift. A writer that sets the
> bundle field omits version. A file carries exactly one of the two.
> 3. Readers that see the bundle field must support the declared bundle or
> fail with an error naming it.
>
> The trick is (2): the deployed readers I checked hard-fail at footer parse
> when FileMetaData.version is missing: parquet-java, arrow-cpp, parquet-rs
> and DuckDB. They all enforce its presence even though the spec says to
> ignore its value. Old readers fail immediately on open instead of tripping
> on obscure errors later, or worse, reading bad data.
>
> This is a one time cost and we should pay it as early as possible. Ratify
> the first bundle's contents: path_in_schema deprecation plus other small
> cleanups. From that point on bundle aware readers fail cleanly on future
> bundles.
>
>
> On Thu, Jun 11, 2026 at 8:28 PM Ryan Blue <[email protected]> wrote:
>
> > Sorry for the duplicate, but Micah said the quote blocks didn't go
> through
> > so I'm re-sending with `>` so that this is more readable.
> >
> > This is quite a large thread, so hopefully I am not missing any big
> points
> > that have already been settled.
> >
> > If I understand correctly, it looks like there are some good things that
> we
> > agree on. The most consequential is that we want to bundle features
> > together.
> >
> > From Antoine’s response to my email about presets, I came away with the
> > clarification that a preset acts as a bundle of features much like a
> > version. This is a big step forward because it would be such a challenge
> > for users to reason about every feature individually and to check support
> > for features across implementations. The worst outcome, in my opinion,
> > would be leaving users or administrators to deal with a wild west of
> > feature flags, so I’m glad to see we’re making progress!
> >
> > This just leaves the details of how we want to manage those feature
> > bundles. The preset option is a mechanistic approach to inclusion, while
> > the version option relies on building consensus for features to include.
> I
> > think this is a good question to focus on right now because I think we
> have
> > yet to come to a shared understanding of both options in order to discuss
> > the trade-offs between them.
> >
> > With the rest of this email, I’ll try to address some confusion I saw in
> > the thread about how versions and/or presets would work. One good thing
> to
> > note is that we’re primarily talking about forward-incompatible changes:
> > changes that would cause older readers to fail and/or read data
> > incorrectly.
> >
> > > Antoine: you also don’t know what’s in a version until the version gets
> > decided upon
> >
> > One of the strengths is that you do get to know what’s in a version ahead
> > of time. The process is deliberate and predictable.
> >
> > When we add a new (forward-incompatible) feature it automatically goes
> into
> > the next format version. I think that we would do this using a vote so
> that
> > everyone here gets to take part in the decision. The features that we
> have
> > agreed on are documented in the format for that version and when we want
> to
> > close the version we have another vote to adopt it. Then new changes go
> > into the next version.
> >
> > This procedure ensures that we agree, via community consensus, on what
> goes
> > into a version and we accumulate a list that is predictable for
> > implementations to target.
> >
> > > Antoine: IIRC the basis for this discussion was to inform Parquet
> writers
> > about which features can safely be enabled.
> >
> > I think this understates the problem and I prefer Andrew’s summary, which
> > is that we need a way for readers and writers to coordinate about this
> > problem. That’s why writer flags alone are untenable because the number
> of
> > things to coordinate is so large.
> >
> > There’s also a lot more to it than “can produce bits for X”. This gets to
> > the hypothetical posed about Parquet 2.34. There are forward-incompatible
> > features that appear readable by older clients but cause them to produce
> > incorrect data. For instance, I could add a field to a page header that
> > gives an offset that should be added to all values in a page, so that we
> > can pack values in smaller bit widths. Older readers would skip the
> offset
> > and produce bad values. That’s one reason why older readers should fail
> for
> > newer feature bundles.
> >
> > Clearly documenting reader responsibilities is a big part of this work as
> > well. If we assume that we will have readers that will fail (which we
> could
> > do), we have to design features that force them to fail. And then we have
> > to deal with bad error messages to users. So thinking through how our
> > system of working with feature bundles is really important, not just for
> > collecting sets of writer flags.
> >
> > > There is no easy-to-read list of changes unless I am missing something.
> >
> > Yes, part of the problem is that the table of features was removed, which
> > is a big part of what caused the current confusion about what v2 is.
> >
> > But this is a problem for both ways to manage version bundles, right?
> >
> > > 2) Let’s say Parquet 2.34 introduces features A and B. Let’s also say a
> > > Parquet reader implements feature A but not feature B. What should this
> > > reader do if you give it a file that has version 2.34 recorded in the
> > > metadata? Should it error out (but perhaps the file only uses feature
> > > A)? Or should it not error out (but perhaps the file uses feature B)?
> >
> > I mentioned this above, but I think this affects both strategies for
> > bundling features and doesn’t really distinguish between them.
> >
> > I will note, though, that we have a clear rule about this in the Iceberg
> > community that works well for versions. Readers must fail if they don’t
> > recognize a version. If a reader knows about a version and has a feature
> > gap, it can look for the missing feature and fail with a good error
> message
> > but otherwise proceed. The main thing is that the feature bundle (version
> > in Iceberg) is understood and is the primary way the group is
> coordinated.
> >
> > > Historically it’s been quite common to have this kind of jagged feature
> > adoption where implementations do not necessarily implement features in
> the
> > chronological order of their appearance in parquet-format.
> >
> > This is something that we are addressing with this discussion! The goal
> is
> > a way to coordinate between readers and writers, right?
> >
> > > Readers already error out when then encounter an unknown encoding in a
> > column they are asked to reader. What do we gain by having them also
> check
> > a version number?
> >
> > You cannot guarantee compatibility with reader failures alone, and you
> > often want better support for even missing features than you get with
> > whatever failure occurs. I think Dan’s doc has a good section on this.
> >
> > Ryan
> >
> > On Thu, Jun 11, 2026 at 1:52 AM Antoine Pitrou <[email protected]>
> wrote:
> >
> > > Le 10/06/2026 à 16:40, Micah Kornfield a écrit :
> > > >
> > > > In any case, this does not seem to be solving the problem of "as a
> > user,
> > > >> how do I enable features safely".
> > > >
> > > > Can you elaborate?  Every feature listed after 2023, hass the year it
> > was
> > > > introduced in parenthesis next to it.
> > > > I think this in addition to
> > > > the  table showing the version that everything was supported in, can
> > > give a
> > > > user a pretty good idea of what might be safe
> > >
> > > Ok, so concretely, what is a user supposed to do with these tables?
> > >
> > > I'm sorry for being so stubborn and insistent, but Parquet files are
> > > produced routinely by data scientists and other people with no expert
> > > knowledge of Parquet internals.
> > >
> > > If "how to produce an optimized Parquet file" takes an entire paragraph
> > > to explain and requires diving into tables of features, then we haven't
> > > solved the problem.
> > >
> > >
> > > (also, even I don't know what to do with the information of "Arrow C++
> > > does not support 2025 features": what does it bring to the reader?)
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > >
> >
>

Re: [DISCUSS] Future of Parquet Versioning

Reply via email to