I wonder if it might take a step back and try to summarize what people see
as requirements before debating any particular approach. Hopefully we can
align on these and then discuss different methods of solving them and the
relative trade-offs. I'll try to summarize my take-aways on implicit
requirements from the thread so far, and some thoughts of my own (note I'm
using SHOULD for most things as a starting point, but I understand others
might want them to be MUST, MAY or non-requirements):
1. A parquet writer MUST never produce a file that any version of a parquet
reader could misinterpret.
Possible Implications:
- We can never just stop writing CONVERTED TYPE, there must be
something else old readers recognize that make the file uninterpretable.
2. A parquet reader SHOULD always be able to provide a detailed error
message that allows a user to understand if and to what version of a
parquet library they need to upgrade to.
Possible implications:
- marking path_in_schema as optional and not populating can likely
not be done without something else that to signal older readers that it
being missing is not an obscure thrift-parsing error.
- Readers that partially implement the specification should be able
to give a detailed error message about the part missing in their
implementation that prevented them from reading the file.
Note: Most features (encodings, logical types, compression)
already provide such a mechanism via enums.
3. The parquet specification SHOULD provide a "standard high-level
vocabulary" for implementations to express what features in Parquet it
supports.
Motivation:
- When reading a parquet file fails, end users should have an easy
time understanding what they need to upgrade to.
4. Implementations SHOULD provide a mechanism to end-users to not need to
micromanage which features they enable by relying on the "standard
high-level vocabulary".
5. The parquet community should provide a clear picture to end-users and
implementers on the current state of implementation compatibility to let
consumers decide appropriate compatibility risk levels for turning on
features.
- Corollary: The parquet specification release process SHOULD allow
users wanting to use the best features of the parquet the ability to do so
as quickly as possible, with the understanding that other implementations
will eventually be compatible.
6. The parquet specification release process SHOULD try to reduce redundant
processes and subjectiveness.
Possible implications:
- Have clear decision criteria on when and what to release in the
specification
- Reduce multiple levels of releases.
7. The parquet specification SHOULD make very clear which changes
introduce compatibility risks.
Did I miss anything? We can see what feedback is like here, but
ultimately, I think we should move this to a google doc, that can cover
concrete proposals and how they map to the requirements.
Hope this helps.
-Micah
On Tue, Jun 9, 2026 at 10:20 AM Andrew Lamb <[email protected]> wrote:
> > I don't understand how it's useful:
> > 1) At this point it's too late, the Parquet file was written already, so
> > this is not solving the user's problem of "how do I choose a safe
> > feature set".
>
> In my mind, the format version is exactly a shared vocabulary for readers
> and writers to agree on a safe feature set.
>
> For example if a writer wants to ensure Spark 4.0 can read their files, (I
> am making up version numbers), they look up and find that spark supports
> features in parquet-format 2.11 and restrict themselves to just those
> features.
>
> > 2) Let's say Parquet 2.34 introduces features A and B. Let's also say a
> > Parquet reader implements feature A but not feature B. What should this
> > reader do if you give it a file that has version 2.34 recorded in the
> > metadata? Should it error out (but perhaps the file only uses feature
> > A)? Or should it not error out (but perhaps the file uses feature B)?
>
> I would suggest:
> 1. Basic readers: error out (simplest to code, and easiest to explain the
> behavior, even though some readable files may be rejected), with a user
> defined "ignore version" field
> 2. Advanced readers: try and check the file for features in 2.34 that it
> doesn't support (e.g. the use of the new ALP encoding) and error if present
>
> In this way, if a reader advertises it supports version 2.8 of the spec,
> then writers can use any of those features, and there is no confusion about
> read compatibility. I agree this is a coarse system, and may mean the
> features in some readers may not be used.
>
> For more advanced uscases and readers without complete support, the writer
> could do more nuanced research about what extra flags / features to enable
>
> We can probably come up with other more precise ways to communicate
> individual feature support (feature buckets, feature matrices, etc) but
> they all seem complicated (and require non trivial consensus on what
> constitutes "major features", for example)
>
> Andrew
>
> On Tue, Jun 9, 2026 at 12:46 PM Antoine Pitrou <[email protected]> wrote:
>
> > Le 09/06/2026 à 18:28, Andrew Lamb a écrit :
> > >
> > >> Aren't we moving the goalposts here?
> > >> IIRC the basis for this discussion was to inform Parquet *writers*
> about
> > >> which features can safely be enabled. Recording the format version in
> a
> > >> Parquet file's metadata does not help achieve that.
> > >
> > > In my mind they are connected -- recording the format in the metadata
> > would
> > > allow writers to explicltly communicate to downstream readers which
> > > features are required for reading,
> >
> > I don't understand how it's useful:
> >
> > 1) At this point it's too late, the Parquet file was written already, so
> > this is not solving the user's problem of "how do I choose a safe
> > feature set".
> >
> > 2) Let's say Parquet 2.34 introduces features A and B. Let's also say a
> > Parquet reader implements feature A but not feature B. What should this
> > reader do if you give it a file that has version 2.34 recorded in the
> > metadata? Should it error out (but perhaps the file only uses feature
> > A)? Or should it not error out (but perhaps the file uses feature B)?
> >
> > Regards
> >
> > Antoine.
> >
> >
> >
>