Re: [DISCUSS] Future of Parquet Versioning

Daniel Weeks Thu, 04 Jun 2026 16:00:10 -0700

Sorry everyone,

I created the document using a new account, and Google flagged it (probably
because many external accounts accessed the Google Doc).


I'm working to get it restored and if I can't, I'll post a new copy, but it
won't include the original comments.

-Dan

On Thu, Jun 4, 2026 at 3:50 PM Ed Seidl <[email protected]> wrote:

> On 2026/06/04 22:01:32 Andrew Bell wrote:
> > How can a reader know that it has the tooling to read a file with this
> > approach?
>
> At present there isn't an in-use mechanism beyond parsing the "created_by"
> string.
>
> > What is the hesitation to change version numbers?
>
> Which version number? The version number in the FileMetaData would sort of
> work,
> except in the case of an incompatible change made to the metadata. We
> could change
> the file magic from PAR1 to something else, but that is not workable
> beyond PAR9, say.
> Also, the file magic really shouldn't change frequently as that breaks
> tools like the unix
> "file" command.
>
> One thought I had, that should not break any current readers, would be to
> expand the header
> from 4 to 8 bytes say. We could embed a version number in bytes 4-7.
> Writing a decimal
> 2026 perhaps (if we use calendar year only), or 202606. Or use SemVer, one
> byte each for
> major/minor/patch. Or make the header longer and embed a fixed-length,
> space or null
> padded string. This expanded header shouldn't break current readers since
> the offset for
> the first page should be obtained from the ColumnMetaData. If there are
> readers that rely
> on a page starting immediately after the 'PAR1', we could mandate that the
> first byte
> following PAR1 is 0. A thrift parser would see that as the end of the
> PageHeader struct
> and then likely fail on missing required fields.
>
> Ed
>
>

Re: [DISCUSS] Future of Parquet Versioning

Reply via email to