Sorry for the duplicate, but Micah said the quote blocks didn't go through so I'm re-sending with `>` so that this is more readable.
This is quite a large thread, so hopefully I am not missing any big points that have already been settled. If I understand correctly, it looks like there are some good things that we agree on. The most consequential is that we want to bundle features together. >From Antoine’s response to my email about presets, I came away with the clarification that a preset acts as a bundle of features much like a version. This is a big step forward because it would be such a challenge for users to reason about every feature individually and to check support for features across implementations. The worst outcome, in my opinion, would be leaving users or administrators to deal with a wild west of feature flags, so I’m glad to see we’re making progress! This just leaves the details of how we want to manage those feature bundles. The preset option is a mechanistic approach to inclusion, while the version option relies on building consensus for features to include. I think this is a good question to focus on right now because I think we have yet to come to a shared understanding of both options in order to discuss the trade-offs between them. With the rest of this email, I’ll try to address some confusion I saw in the thread about how versions and/or presets would work. One good thing to note is that we’re primarily talking about forward-incompatible changes: changes that would cause older readers to fail and/or read data incorrectly. > Antoine: you also don’t know what’s in a version until the version gets decided upon One of the strengths is that you do get to know what’s in a version ahead of time. The process is deliberate and predictable. When we add a new (forward-incompatible) feature it automatically goes into the next format version. I think that we would do this using a vote so that everyone here gets to take part in the decision. The features that we have agreed on are documented in the format for that version and when we want to close the version we have another vote to adopt it. Then new changes go into the next version. This procedure ensures that we agree, via community consensus, on what goes into a version and we accumulate a list that is predictable for implementations to target. > Antoine: IIRC the basis for this discussion was to inform Parquet writers about which features can safely be enabled. I think this understates the problem and I prefer Andrew’s summary, which is that we need a way for readers and writers to coordinate about this problem. That’s why writer flags alone are untenable because the number of things to coordinate is so large. There’s also a lot more to it than “can produce bits for X”. This gets to the hypothetical posed about Parquet 2.34. There are forward-incompatible features that appear readable by older clients but cause them to produce incorrect data. For instance, I could add a field to a page header that gives an offset that should be added to all values in a page, so that we can pack values in smaller bit widths. Older readers would skip the offset and produce bad values. That’s one reason why older readers should fail for newer feature bundles. Clearly documenting reader responsibilities is a big part of this work as well. If we assume that we will have readers that will fail (which we could do), we have to design features that force them to fail. And then we have to deal with bad error messages to users. So thinking through how our system of working with feature bundles is really important, not just for collecting sets of writer flags. > There is no easy-to-read list of changes unless I am missing something. Yes, part of the problem is that the table of features was removed, which is a big part of what caused the current confusion about what v2 is. But this is a problem for both ways to manage version bundles, right? > 2) Let’s say Parquet 2.34 introduces features A and B. Let’s also say a > Parquet reader implements feature A but not feature B. What should this > reader do if you give it a file that has version 2.34 recorded in the > metadata? Should it error out (but perhaps the file only uses feature > A)? Or should it not error out (but perhaps the file uses feature B)? I mentioned this above, but I think this affects both strategies for bundling features and doesn’t really distinguish between them. I will note, though, that we have a clear rule about this in the Iceberg community that works well for versions. Readers must fail if they don’t recognize a version. If a reader knows about a version and has a feature gap, it can look for the missing feature and fail with a good error message but otherwise proceed. The main thing is that the feature bundle (version in Iceberg) is understood and is the primary way the group is coordinated. > Historically it’s been quite common to have this kind of jagged feature adoption where implementations do not necessarily implement features in the chronological order of their appearance in parquet-format. This is something that we are addressing with this discussion! The goal is a way to coordinate between readers and writers, right? > Readers already error out when then encounter an unknown encoding in a column they are asked to reader. What do we gain by having them also check a version number? You cannot guarantee compatibility with reader failures alone, and you often want better support for even missing features than you get with whatever failure occurs. I think Dan’s doc has a good section on this. Ryan On Thu, Jun 11, 2026 at 1:52 AM Antoine Pitrou <[email protected]> wrote: > Le 10/06/2026 à 16:40, Micah Kornfield a écrit : > > > > In any case, this does not seem to be solving the problem of "as a user, > >> how do I enable features safely". > > > > Can you elaborate? Every feature listed after 2023, hass the year it was > > introduced in parenthesis next to it. > > I think this in addition to > > the table showing the version that everything was supported in, can > give a > > user a pretty good idea of what might be safe > > Ok, so concretely, what is a user supposed to do with these tables? > > I'm sorry for being so stubborn and insistent, but Parquet files are > produced routinely by data scientists and other people with no expert > knowledge of Parquet internals. > > If "how to produce an optimized Parquet file" takes an entire paragraph > to explain and requires diving into tables of features, then we haven't > solved the problem. > > > (also, even I don't know what to do with the information of "Arrow C++ > does not support 2025 features": what does it bring to the reader?) > > Regards > > Antoine. > > >
