Hello,

In https://github.com/apache/arrow/issues/49058 it was reported that PyArrow (and therefore Arrow C++) happily accepts non-UTF8 metadata on fields and schemas; however, that metadata is IPC-encoded as the `string` type in Flatbuffers which is theoretically restricted to UTF8 (apparently, the Flatbuffers validator does not check for that).

Several questions ensue:

1) Should the C++ IPC writer - and potentially other implementations - ensure that only valid UTF8 strings can be serialized as Flatbuffers `string`s (which would apply not only to key-value metadata strings, but also timezones and field names)?

2) Should Arrow C++ reject key-value metadata with non-UTF8 strings as invalid, even if they are never serialized over IPC?

3) Should the C Data Interface recommend that type metadata keys and values (*) be valid UTF8 as well?

(*) https://arrow.apache.org/docs/format/CDataInterface.html#c.ArrowSchema.metadata

Thanks

Antoine.


Reply via email to