Re: [DISCUSS] Future of Parquet Versioning

Fokko Driesprong Fri, 05 Jun 2026 05:07:15 -0700

Thanks Dan for the write-up and this is definitly something that we need to 
address. I'm able to see the doc again.

To echo what's being said. I very much agree with Andrew that we don't have the 
luxury of mandating adoption, and this makes the implementation status matrix 
an important thing to have and to keep up to date.

I know an OSS engine that is still writing INT96 which has been deprecated over 
8 years ago. I agree with Steve that we have to keep the readers we want to 
maintain "we can read all your existing data".

One thing I dislike is putting a time-based constraint on features. The Parquet 
project has been around for a long time, with a huge variation of activity. 
Meaning some presets would contain a lot of new features, while others might 
contain very little. Same goes per feature, some features are massive, while 
others are relative trivial.

We have a calendar-based release cadence on the website: 
https://parquet.apache.org/docs/contribution-guidelines/releasing/#release-cadence.
 But in practice a release is started when there is something to release.

My preference would be, instead of thinking of mythical/monolithic versions, we 
could break it down into smaller chunks and release more often. Each OSS 
project tries to cram their latest feature in the release that has already 
started, but it would be good to get it out of the door instead. Of course, 
depending on the feature with enough AIs (authoritative implementations :-).

Kind regards,
Fokko

On 2026/06/05 01:37:46 Daniel Weeks wrote:
> Doc is back up, sorry for the interruption.
> 
> -Dan
> 
> On Thu, Jun 4, 2026 at 3:59 PM Daniel Weeks <[email protected]> wrote:
> 
> > Sorry everyone,
> >
> > I created the document using a new account, and Google flagged it
> > (probably because many external accounts accessed the Google Doc).
> >
> > I'm working to get it restored and if I can't, I'll post a new copy, but
> > it won't include the original comments.
> >
> > -Dan
> >
> > On Thu, Jun 4, 2026 at 3:50 PM Ed Seidl <[email protected]> wrote:
> >
> >> On 2026/06/04 22:01:32 Andrew Bell wrote:
> >> > How can a reader know that it has the tooling to read a file with this
> >> > approach?
> >>
> >> At present there isn't an in-use mechanism beyond parsing the
> >> "created_by" string.
> >>
> >> > What is the hesitation to change version numbers?
> >>
> >> Which version number? The version number in the FileMetaData would sort
> >> of work,
> >> except in the case of an incompatible change made to the metadata. We
> >> could change
> >> the file magic from PAR1 to something else, but that is not workable
> >> beyond PAR9, say.
> >> Also, the file magic really shouldn't change frequently as that breaks
> >> tools like the unix
> >> "file" command.
> >>
> >> One thought I had, that should not break any current readers, would be to
> >> expand the header
> >> from 4 to 8 bytes say. We could embed a version number in bytes 4-7.
> >> Writing a decimal
> >> 2026 perhaps (if we use calendar year only), or 202606. Or use SemVer,
> >> one byte each for
> >> major/minor/patch. Or make the header longer and embed a fixed-length,
> >> space or null
> >> padded string. This expanded header shouldn't break current readers since
> >> the offset for
> >> the first page should be obtained from the ColumnMetaData. If there are
> >> readers that rely
> >> on a page starting immediately after the 'PAR1', we could mandate that
> >> the first byte
> >> following PAR1 is 0. A thrift parser would see that as the end of the
> >> PageHeader struct
> >> and then likely fail on missing required fields.
> >>
> >> Ed
> >>
> >>
>

Re: [DISCUSS] Future of Parquet Versioning

Reply via email to