Le 09/06/2026 à 19:19, Andrew Lamb a écrit :
I don't understand how it's useful:
1) At this point it's too late, the Parquet file was written already, so
this is not solving the user's problem of "how do I choose a safe
feature set".
In my mind, the format version is exactly a shared vocabulary for readers
and writers to agree on a safe feature set.
For example if a writer wants to ensure Spark 4.0 can read their files, (I
am making up version numbers), they look up and find that spark supports
features in parquet-format 2.11 and restrict themselves to just those
features.
What if Spark supports some features from 2.12, but doesn't support all
the features from 2.11 (or even 2.6), for example?
Historically it's been quite common to have this kind of jagged feature
adoption where implementations do not necessarily implement features in
the chronological order of their appearance in parquet-format. Just
because something is in parquet-format doesn't mean it will get wide
adoption.
(perhaps some Parquet readers still don't implement modular encryption,
for example? and let's not talk about INT96 timestamps or LZO
compression...)
2) Let's say Parquet 2.34 introduces features A and B. Let's also say a
Parquet reader implements feature A but not feature B. What should this
reader do if you give it a file that has version 2.34 recorded in the
metadata? Should it error out (but perhaps the file only uses feature
A)? Or should it not error out (but perhaps the file uses feature B)?
I would suggest:
1. Basic readers: error out (simplest to code, and easiest to explain the
behavior, even though some readable files may be rejected), with a user
defined "ignore version" field
2. Advanced readers: try and check the file for features in 2.34 that it
doesn't support (e.g. the use of the new ALP encoding) and error if present
Readers *already* error out when then encounter an unknown encoding in a
column they are asked to reader. What do we gain by having them *also*
check a version number?
For more advanced uscases and readers without complete support, the writer
could do more nuanced research about what extra flags / features to enable
This is the statu quo, and it doesn't work well as users generally
settle on the conservative defaults exposed by mainstream writers.
The problem I think we're trying to solve is to make it easier and safer
for users to enable modern features that produce more optimized and more
efficient Parquet files.
We can probably come up with other more precise ways to communicate
individual feature support (feature buckets, feature matrices, etc) but
they all seem complicated (and require non trivial consensus on what
constitutes "major features", for example)
I agree with the "non-trivial consensus" problem, and that's the point
of calendar-based presets: they eschew the need for "non-trivial
consensus" as they are based on actual adoption. :-)
Regards
Antoine.