In addition to putting additional data directly in the thrift metadata
(either as key=value pairs or thrift fields), another approach is to store
the information "inline" in the file's body and store only an offset to the
information in the key=value metadata (this is the approach explained in
this blog[1] for indexes, but it can be used to store any arbitrary bytes)

Andrew

[1]:
https://datafusion.apache.org/blog/2025/07/14/user-defined-parquet-indexes/

On Fri, Nov 14, 2025 at 3:07 PM Andrew Bell <[email protected]>
wrote:

> On Sun, Nov 2, 2025 at 1:00 PM Jan Finis <[email protected]> wrote:
>
> > Note that you can already put such metadata into the footer by just
> putting
> > it into the regular key-value metadata. Put a JSON array as value there
> > with the same number of entries as the schema, then you have an implicit
> > 1-to-1 mapping per column. We already use this to store per-column
> metadata
> > and haven't encountered any problems with it so far.
> >
>
> Of course you can put anything you want into a single metadata slot.
> The hope is to have something that's sensible and semantically clear. An
> advantage of the Thrift encoding is that adding structure entries doesn't
> impact existing readers as they ignore values that they don't recognize.
>
> I think this is a free lunch proposal -- there is benefit and no harm.
>
> Here is another possibility: how about allowing extension of the Parquet
> Thrift IDL in general by permitting all negative values in defined Structs
> to be owned by users? There could be some registry if desired, but
> something like this would allow users to add whatever data they like to the
> existing metadata layout without impacting those using the standard IDL.
> Although the Thrift IDL doc doesn't specify size for a Struct identifier,
> the generated .tcc code uses a signed 16 bit value. This should allow for
> plenty of additions to the accepted spec and user additions as well. Again,
> there would be no impact to existing readers or writers.
>
> --
> Andrew Bell
> [email protected]
>

Reply via email to