Hi Antoine,

Your idea seems very reasonable to be. Also remember record batches themselves 
have custom_metadata that is also subject to this UTF-8 only restriction.

I'd imagine we could suggest the validation methods in various implementations 
would also check for this.

Rusty

On Thu, May 7, 2026, at 10:53 AM, Antoine Pitrou wrote:
> Hello,
>
> In https://github.com/apache/arrow/issues/49058 it was reported that 
> PyArrow (and therefore Arrow C++) happily accepts non-UTF8 metadata on 
> fields and schemas; however, that metadata is IPC-encoded as the 
> `string` type in Flatbuffers which is theoretically restricted to UTF8 
> (apparently, the Flatbuffers validator does not check for that).
>
> Several questions ensue:
>
> 1) Should the C++ IPC writer - and potentially other implementations - 
> ensure that only valid UTF8 strings can be serialized as Flatbuffers 
> `string`s (which would apply not only to key-value metadata strings, but 
> also timezones and field names)?
>
> 2) Should Arrow C++ reject key-value metadata with non-UTF8 strings as 
> invalid, even if they are never serialized over IPC?
>
> 3) Should the C Data Interface recommend that type metadata keys and 
> values (*) be valid UTF8 as well?
>
> (*) 
> https://arrow.apache.org/docs/format/CDataInterface.html#c.ArrowSchema.metadata
>
> Thanks
>
> Antoine.

-- 
Rusty Conover
Query.Farm Founder
https://query.farm

Reply via email to