Re: [DISCUSS] Future of Parquet Versioning

Ryan Blue Fri, 05 Jun 2026 12:57:34 -0700

   - backwards *compatible*: old readers can still read files (e.g.
   PageIndex, new logical types)
   - backwards *incompatible*: old readers can not still read the files
   (e.g. new encodings, proposed path_in_schema removal, …)


I agree with the categories, but I want to be careful about terminology. I
would call these *forward* compatible or *forward* incompatible. The reason
is that *backward* compatible usually means that newer versions can
interact with older data, rather than older versions interacting with newer
data.

For example, backward compatibility would mean that although a version
writes DataPageV2, it can still read DataPageV1. On the other hand, forward
compatibility is when we design features in a way that older readers will
ignore if they don’t know about them, like additional thrift fields that
are not necessary for correctly reading the data, but may allow clients to
find specific data more quickly.

I tend to refer to “forward-incompatible” changes when we’re talking about
breaking changes that would cause any existing reader to fail or produce
incorrect results.

Ryan

On Fri, Jun 5, 2026 at 7:14 AM Andrew Lamb <[email protected]> wrote:

> Ryan and Dan made a great point on the call the other day that there are
> two categories of new features:
> - backwards **compatible**: old readers can still read files (e.g.
> PageIndex, new logical types)
> - backwards **incompatible**: old readers can not still read the files
> (e.g. new encodings, proposed path_in_schema removal, ...)
>
> The recently approved new features / changes we have added to the spec
> recently are mostly **backwards compatible** (e.g. Variant) and thus didn't
> need ecosystem wide coordination
>
> I think there is more friction on new incompatible changes (older readers
> will not be able to read files written with these features)
>
> I agree with Dan, Ryan and others that unless we define some signal in the
> file itself (e.g. version 3 😬) it will be close to impossible for users to
> understand which features are compatible with other systems
>
> To help this process along, I made a PR to document more clearly which
> features are in which version 1 / version 2[1] that I think will help. I
> also drafted an example of what "V3" could look like [2].
>
> Andrew
>
> [1]: https://github.com/apache/parquet-site/pull/186
> [2]: https://github.com/alamb/parquet-site/pull/1
>
> On Fri, Jun 5, 2026 at 8:39 AM Antoine Pitrou <[email protected]> wrote:
>
> >
> > The purpose of the presets proposal is not to inform readers but to help
> > users make a decision about which features to enable when writing a
> > Parquet file.
> >
> > For example, a user of PyArrow could, instead of passing an elaborate
> > set of flags, call `pq.write_table(tab, 'file.pq', preset='2024-01')`.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 05/06/2026 à 00:01, Andrew Bell a écrit :
> > > How can a reader know that it has the tooling to read a file with this
> > > approach? What is the hesitation to change version numbers?
> > >
> > > --
> > >
> > > Andrew Bell
> > > [email protected]
> > >
> > > On Thu, Jun 4, 2026, 4:37 PM Ed Seidl <[email protected]> wrote:
> > >
> > >> On 2026/06/04 20:17:45 Ryan Blue wrote:
> > >>> What's a preset? Could you describe the idea in this discussion so we
> > can
> > >>> keep it in one place?
> > >>>
> > >>
> > >> The concept was introduced earlier in this thread by Antoine.
> > >> https://lists.apache.org/thread/gvw48wrkhgl83jljhd1hzb668ys9zvqx
> > >>
> > >> Cheers,
> > >> Ed
> > >>
> > >
> >
> >
> >
>

Re: [DISCUSS] Future of Parquet Versioning

Reply via email to