Le 09/06/2026 à 19:19, Andrew Lamb a écrit :
I don't understand how it's useful:
1) At this point it's too late, the Parquet file was written already, so
this is not solving the user's problem of "how do I choose a safe
feature set".

In my mind, the format version is exactly a shared vocabulary for readers
and writers to agree on a safe feature set.

For  example if a writer wants to ensure Spark 4.0 can read their files, (I
am making up version numbers), they look up and find that spark supports
features in parquet-format 2.11 and restrict themselves to just those
features.

What if Spark supports some features from 2.12, but doesn't support all the features from 2.11 (or even 2.6), for example?

Historically it's been quite common to have this kind of jagged feature adoption where implementations do not necessarily implement features in the chronological order of their appearance in parquet-format. Just because something is in parquet-format doesn't mean it will get wide adoption.

(perhaps some Parquet readers still don't implement modular encryption, for example? and let's not talk about INT96 timestamps or LZO compression...)

2) Let's say Parquet 2.34 introduces features A and B. Let's also say a
Parquet reader implements feature A but not feature B. What should this
reader do if you give it a file that has version 2.34 recorded in the
metadata? Should it error out (but perhaps the file only uses feature
A)? Or should it not error out (but perhaps the file uses feature B)?

I would suggest:
1. Basic readers: error out (simplest to code, and easiest to explain the
behavior, even though some readable files may be rejected), with a user
defined "ignore version" field
2. Advanced readers:  try and check the file for features in 2.34 that it
doesn't support (e.g. the use of the new ALP encoding) and error if present

Readers *already* error out when then encounter an unknown encoding in a column they are asked to reader. What do we gain by having them *also* check a version number?

For more advanced uscases and readers without complete support, the writer
could do more nuanced research about what extra flags / features to enable

This is the statu quo, and it doesn't work well as users generally settle on the conservative defaults exposed by mainstream writers.

The problem I think we're trying to solve is to make it easier and safer for users to enable modern features that produce more optimized and more efficient Parquet files.

We can probably come up with other more precise ways to communicate
individual feature support (feature buckets, feature matrices, etc) but
they all seem complicated (and require non trivial consensus on what
constitutes "major features", for example)

I agree with the "non-trivial consensus" problem, and that's the point of calendar-based presets: they eschew the need for "non-trivial consensus" as they are based on actual adoption. :-)

Regards

Antoine.


Reply via email to