In addition to putting additional data directly in the thrift metadata (either as key=value pairs or thrift fields), another approach is to store the information "inline" in the file's body and store only an offset to the information in the key=value metadata (this is the approach explained in this blog[1] for indexes, but it can be used to store any arbitrary bytes)
Andrew [1]: https://datafusion.apache.org/blog/2025/07/14/user-defined-parquet-indexes/ On Fri, Nov 14, 2025 at 3:07 PM Andrew Bell <[email protected]> wrote: > On Sun, Nov 2, 2025 at 1:00 PM Jan Finis <[email protected]> wrote: > > > Note that you can already put such metadata into the footer by just > putting > > it into the regular key-value metadata. Put a JSON array as value there > > with the same number of entries as the schema, then you have an implicit > > 1-to-1 mapping per column. We already use this to store per-column > metadata > > and haven't encountered any problems with it so far. > > > > Of course you can put anything you want into a single metadata slot. > The hope is to have something that's sensible and semantically clear. An > advantage of the Thrift encoding is that adding structure entries doesn't > impact existing readers as they ignore values that they don't recognize. > > I think this is a free lunch proposal -- there is benefit and no harm. > > Here is another possibility: how about allowing extension of the Parquet > Thrift IDL in general by permitting all negative values in defined Structs > to be owned by users? There could be some registry if desired, but > something like this would allow users to add whatever data they like to the > existing metadata layout without impacting those using the standard IDL. > Although the Thrift IDL doc doesn't specify size for a Struct identifier, > the generated .tcc code uses a signed 16 bit value. This should allow for > plenty of additions to the accepted spec and user additions as well. Again, > there would be no impact to existing readers or writers. > > -- > Andrew Bell > [email protected] >
