Re: [DISCUSS] Future of Parquet Versioning

Antoine Pitrou Wed, 10 Jun 2026 00:00:30 -0700


Le 09/06/2026 à 19:19, Andrew Lamb a écrit :

I don't understand how it's useful:
1) At this point it's too late, the Parquet file was written already, so
this is not solving the user's problem of "how do I choose a safe
feature set".


In my mind, the format version is exactly a shared vocabulary for readers
and writers to agree on a safe feature set.

For  example if a writer wants to ensure Spark 4.0 can read their files, (I
am making up version numbers), they look up and find that spark supports
features in parquet-format 2.11 and restrict themselves to just those
features.

What if Spark supports some features from 2.12, but doesn't support allthe features from 2.11 (or even 2.6), for example?

Historically it's been quite common to have this kind of jagged featureadoption where implementations do not necessarily implement features inthe chronological order of their appearance in parquet-format. Justbecause something is in parquet-format doesn't mean it will get wideadoption.

(perhaps some Parquet readers still don't implement modular encryption,for example? and let's not talk about INT96 timestamps or LZOcompression...)

2) Let's say Parquet 2.34 introduces features A and B. Let's also say a
Parquet reader implements feature A but not feature B. What should this
reader do if you give it a file that has version 2.34 recorded in the
metadata? Should it error out (but perhaps the file only uses feature
A)? Or should it not error out (but perhaps the file uses feature B)?


I would suggest:
1. Basic readers: error out (simplest to code, and easiest to explain the
behavior, even though some readable files may be rejected), with a user
defined "ignore version" field
2. Advanced readers:  try and check the file for features in 2.34 that it
doesn't support (e.g. the use of the new ALP encoding) and error if present

Readers *already* error out when then encounter an unknown encoding in acolumn they are asked to reader. What do we gain by having them *also*check a version number?

For more advanced uscases and readers without complete support, the writer
could do more nuanced research about what extra flags / features to enable

This is the statu quo, and it doesn't work well as users generallysettle on the conservative defaults exposed by mainstream writers.

The problem I think we're trying to solve is to make it easier and saferfor users to enable modern features that produce more optimized and moreefficient Parquet files.

We can probably come up with other more precise ways to communicate
individual feature support (feature buckets, feature matrices, etc) but
they all seem complicated (and require non trivial consensus on what
constitutes "major features", for example)

I agree with the "non-trivial consensus" problem, and that's the pointof calendar-based presets: they eschew the need for "non-trivialconsensus" as they are based on actual adoption. :-)


Regards

Antoine.

Re: [DISCUSS] Future of Parquet Versioning

Reply via email to