> > Sure, it's conceptually nicer to have it > within the schema, but do you see any concrete disadvantage besides > conceptual clarity of just putting it into the existing key value metadata > section?
Conceptual clarity actually seems like it is pretty nice, are there technical down-sides to adding this even though work-arounds exist? Another use-case that has an open issue against it is adding a "description" field for each column. I think there are three options here: 1. Define a well-known Key-Value in the file metadata for this purpose. 2. Add a specialized field for it. 3. Add key-value to schema elements. On Sat, Nov 15, 2025 at 3:55 AM Andrew Lamb <[email protected]> wrote: > In addition to putting additional data directly in the thrift metadata > (either as key=value pairs or thrift fields), another approach is to store > the information "inline" in the file's body and store only an offset to the > information in the key=value metadata (this is the approach explained in > this blog[1] for indexes, but it can be used to store any arbitrary bytes) > > Andrew > > [1]: > https://datafusion.apache.org/blog/2025/07/14/user-defined-parquet-indexes/ > > On Fri, Nov 14, 2025 at 3:07 PM Andrew Bell <[email protected]> > wrote: > > > On Sun, Nov 2, 2025 at 1:00 PM Jan Finis <[email protected]> wrote: > > > > > Note that you can already put such metadata into the footer by just > > putting > > > it into the regular key-value metadata. Put a JSON array as value there > > > with the same number of entries as the schema, then you have an > implicit > > > 1-to-1 mapping per column. We already use this to store per-column > > metadata > > > and haven't encountered any problems with it so far. > > > > > > > Of course you can put anything you want into a single metadata slot. > > The hope is to have something that's sensible and semantically clear. An > > advantage of the Thrift encoding is that adding structure entries doesn't > > impact existing readers as they ignore values that they don't recognize. > > > > I think this is a free lunch proposal -- there is benefit and no harm. > > > > Here is another possibility: how about allowing extension of the Parquet > > Thrift IDL in general by permitting all negative values in defined > Structs > > to be owned by users? There could be some registry if desired, but > > something like this would allow users to add whatever data they like to > the > > existing metadata layout without impacting those using the standard IDL. > > Although the Thrift IDL doc doesn't specify size for a Struct identifier, > > the generated .tcc code uses a signed 16 bit value. This should allow for > > plenty of additions to the accepted spec and user additions as well. > Again, > > there would be no impact to existing readers or writers. > > > > -- > > Andrew Bell > > [email protected] > > >
