>
> Sure, it's conceptually nicer to have it
> within the schema, but do you see any concrete disadvantage besides
> conceptual clarity of just putting it into the existing key value metadata
> section?


Conceptual clarity actually seems like it is pretty nice, are there
technical down-sides to adding this even though work-arounds exist? Another
use-case that has an open issue against it is adding a "description" field
for each column.  I think there are three options here:
1.  Define a well-known Key-Value in the file metadata for this purpose.
2.  Add a specialized field for it.
3.  Add key-value to schema elements.



On Sat, Nov 15, 2025 at 3:55 AM Andrew Lamb <[email protected]> wrote:

> In addition to putting additional data directly in the thrift metadata
> (either as key=value pairs or thrift fields), another approach is to store
> the information "inline" in the file's body and store only an offset to the
> information in the key=value metadata (this is the approach explained in
> this blog[1] for indexes, but it can be used to store any arbitrary bytes)
>
> Andrew
>
> [1]:
> https://datafusion.apache.org/blog/2025/07/14/user-defined-parquet-indexes/
>
> On Fri, Nov 14, 2025 at 3:07 PM Andrew Bell <[email protected]>
> wrote:
>
> > On Sun, Nov 2, 2025 at 1:00 PM Jan Finis <[email protected]> wrote:
> >
> > > Note that you can already put such metadata into the footer by just
> > putting
> > > it into the regular key-value metadata. Put a JSON array as value there
> > > with the same number of entries as the schema, then you have an
> implicit
> > > 1-to-1 mapping per column. We already use this to store per-column
> > metadata
> > > and haven't encountered any problems with it so far.
> > >
> >
> > Of course you can put anything you want into a single metadata slot.
> > The hope is to have something that's sensible and semantically clear. An
> > advantage of the Thrift encoding is that adding structure entries doesn't
> > impact existing readers as they ignore values that they don't recognize.
> >
> > I think this is a free lunch proposal -- there is benefit and no harm.
> >
> > Here is another possibility: how about allowing extension of the Parquet
> > Thrift IDL in general by permitting all negative values in defined
> Structs
> > to be owned by users? There could be some registry if desired, but
> > something like this would allow users to add whatever data they like to
> the
> > existing metadata layout without impacting those using the standard IDL.
> > Although the Thrift IDL doc doesn't specify size for a Struct identifier,
> > the generated .tcc code uses a signed 16 bit value. This should allow for
> > plenty of additions to the accepted spec and user additions as well.
> Again,
> > there would be no impact to existing readers or writers.
> >
> > --
> > Andrew Bell
> > [email protected]
> >
>

Reply via email to