I believe that anybody putting non-UTF8 strings in key/value metadata
might run into hard to track down errors...even if it were ever
formally allowed, it would never really be a good idea for any degree
of interoperability. Base64 is always an option for any reasonable
size of binary one might want to put in field or schema metadata.
(Custom metadata for IPC messages might benefit from skipping base64
encoding but it's also easy to update the flatbuffer schema as has
already been proposed).

I think the initial discussion about relaxing the requirement came
partially from me wanting to implement GeoArrow in C without a
requirement for a JSON parser, where something like yyjson is
approximately the size of the rest of the library combined. We went
with JSON and I wrote a tiny JSON parser and we moved on :)

Cheers,

-dewey

On Thu, May 7, 2026 at 11:50 AM Antoine Pitrou <[email protected]> wrote:
>
>
> Thanks for the pointer, Raphael. This in turn also refers to the
> following past discussion where Joris suggested relaxing the UTF8
> requirement:
> https://lists.apache.org/thread/blmj0cgv34dgdxqd3ow60ln68khnz0qr
>
> However, two things have changed since then:
>
> 1) PyArrow has stopped putting non-UTF8 data in metadata when
> serializing extension types, because the use of `pickle` has been
> abandoned as hopelessly insecure
>
> 2) All parametric canonical extension types use a (UTF8-encoded) JSON
> payload as serialization, making it less attractive to use a custom
> binary encoding for other (non-canonical) extension types.
>
> So I'm not sure there's still a need for arbitrary binary data in
> key-value pairs (though, of course, it might be a good idea if we were
> starting over and redesigning Arrow).
>
> Regards
>
> Antoine.
>
>
>
> Le 07/05/2026 à 18:37, Raphael Taylor-Davies a écrit :
> > Hi All,
> >
> > One thing to perhaps be aware of is that pyarrow at least used to
> > produce non-UTF8 data in the metadata [1].
> >
> > This was actually reported as a bug in arrow-rs, which validates this [2]
> >
> > Kind Regards,
> >
> > Raphael Taylor-Davies
> >
> > [1]: https://github.com/apache/arrow/issues/20107
> > [2]: https://github.com/apache/arrow-rs/issues/5547
> >
> > On 07/05/2026 16:33, Antoine Pitrou wrote:
> >>
> >> Yes, I think validation methods would have to be enhanced to cover this.
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >>
> >> Le 07/05/2026 à 17:28, Rusty Conover a écrit :
> >>> Hi Antoine,
> >>>
> >>> Your idea seems very reasonable to be. Also remember record batches
> >>> themselves have custom_metadata that is also subject to this UTF-8
> >>> only restriction.
> >>>
> >>> I'd imagine we could suggest the validation methods in various
> >>> implementations would also check for this.
> >>>
> >>> Rusty
> >>>
> >>> On Thu, May 7, 2026, at 10:53 AM, Antoine Pitrou wrote:
> >>>> Hello,
> >>>>
> >>>> In https://github.com/apache/arrow/issues/49058 it was reported that
> >>>> PyArrow (and therefore Arrow C++) happily accepts non-UTF8 metadata on
> >>>> fields and schemas; however, that metadata is IPC-encoded as the
> >>>> `string` type in Flatbuffers which is theoretically restricted to UTF8
> >>>> (apparently, the Flatbuffers validator does not check for that).
> >>>>
> >>>> Several questions ensue:
> >>>>
> >>>> 1) Should the C++ IPC writer - and potentially other implementations -
> >>>> ensure that only valid UTF8 strings can be serialized as Flatbuffers
> >>>> `string`s (which would apply not only to key-value metadata strings,
> >>>> but
> >>>> also timezones and field names)?
> >>>>
> >>>> 2) Should Arrow C++ reject key-value metadata with non-UTF8 strings as
> >>>> invalid, even if they are never serialized over IPC?
> >>>>
> >>>> 3) Should the C Data Interface recommend that type metadata keys and
> >>>> values (*) be valid UTF8 as well?
> >>>>
> >>>> (*)
> >>>> https://arrow.apache.org/docs/format/CDataInterface.html#c.ArrowSchema.metadata
> >>>>
> >>>>
> >>>> Thanks
> >>>>
> >>>> Antoine.
> >>>
> >>
>

Reply via email to