Yes, I think validation methods would have to be enhanced to cover this.
Regards
Antoine.
Le 07/05/2026 à 17:28, Rusty Conover a écrit :
Hi Antoine,
Your idea seems very reasonable to be. Also remember record batches themselves
have custom_metadata that is also subject to this UTF-8 only restriction.
I'd imagine we could suggest the validation methods in various implementations
would also check for this.
Rusty
On Thu, May 7, 2026, at 10:53 AM, Antoine Pitrou wrote:
Hello,
In https://github.com/apache/arrow/issues/49058 it was reported that
PyArrow (and therefore Arrow C++) happily accepts non-UTF8 metadata on
fields and schemas; however, that metadata is IPC-encoded as the
`string` type in Flatbuffers which is theoretically restricted to UTF8
(apparently, the Flatbuffers validator does not check for that).
Several questions ensue:
1) Should the C++ IPC writer - and potentially other implementations -
ensure that only valid UTF8 strings can be serialized as Flatbuffers
`string`s (which would apply not only to key-value metadata strings, but
also timezones and field names)?
2) Should Arrow C++ reject key-value metadata with non-UTF8 strings as
invalid, even if they are never serialized over IPC?
3) Should the C Data Interface recommend that type metadata keys and
values (*) be valid UTF8 as well?
(*)
https://arrow.apache.org/docs/format/CDataInterface.html#c.ArrowSchema.metadata
Thanks
Antoine.