Hi Antoine, Your idea seems very reasonable to be. Also remember record batches themselves have custom_metadata that is also subject to this UTF-8 only restriction.
I'd imagine we could suggest the validation methods in various implementations would also check for this. Rusty On Thu, May 7, 2026, at 10:53 AM, Antoine Pitrou wrote: > Hello, > > In https://github.com/apache/arrow/issues/49058 it was reported that > PyArrow (and therefore Arrow C++) happily accepts non-UTF8 metadata on > fields and schemas; however, that metadata is IPC-encoded as the > `string` type in Flatbuffers which is theoretically restricted to UTF8 > (apparently, the Flatbuffers validator does not check for that). > > Several questions ensue: > > 1) Should the C++ IPC writer - and potentially other implementations - > ensure that only valid UTF8 strings can be serialized as Flatbuffers > `string`s (which would apply not only to key-value metadata strings, but > also timezones and field names)? > > 2) Should Arrow C++ reject key-value metadata with non-UTF8 strings as > invalid, even if they are never serialized over IPC? > > 3) Should the C Data Interface recommend that type metadata keys and > values (*) be valid UTF8 as well? > > (*) > https://arrow.apache.org/docs/format/CDataInterface.html#c.ArrowSchema.metadata > > Thanks > > Antoine. -- Rusty Conover Query.Farm Founder https://query.farm
