custom metadata restriction to UTF8?

Antoine Pitrou Thu, 07 May 2026 09:50:56 -0700

Thanks for the pointer, Raphael. This in turn also refers to thefollowing past discussion where Joris suggested relaxing the UTF8requirement:

https://lists.apache.org/thread/blmj0cgv34dgdxqd3ow60ln68khnz0qr

However, two things have changed since then:

1) PyArrow has stopped putting non-UTF8 data in metadata whenserializing extension types, because the use of `pickle` has beenabandoned as hopelessly insecure

2) All parametric canonical extension types use a (UTF8-encoded) JSONpayload as serialization, making it less attractive to use a custombinary encoding for other (non-canonical) extension types.

So I'm not sure there's still a need for arbitrary binary data inkey-value pairs (though, of course, it might be a good idea if we werestarting over and redesigning Arrow).


Regards

Antoine.



Le 07/05/2026 à 18:37, Raphael Taylor-Davies a écrit :

Hi All,

One thing to perhaps be aware of is that pyarrow at least used to
produce non-UTF8 data in the metadata [1].

This was actually reported as a bug in arrow-rs, which validates this [2]

Kind Regards,

Raphael Taylor-Davies

[1]: https://github.com/apache/arrow/issues/20107
[2]: https://github.com/apache/arrow-rs/issues/5547

On 07/05/2026 16:33, Antoine Pitrou wrote:


Yes, I think validation methods would have to be enhanced to cover this.

Regards

Antoine.


Le 07/05/2026 à 17:28, Rusty Conover a écrit :

Hi Antoine,

Your idea seems very reasonable to be. Also remember record batches
themselves have custom_metadata that is also subject to this UTF-8
only restriction.

I'd imagine we could suggest the validation methods in various
implementations would also check for this.

Rusty

On Thu, May 7, 2026, at 10:53 AM, Antoine Pitrou wrote:

Hello,

In https://github.com/apache/arrow/issues/49058 it was reported that
PyArrow (and therefore Arrow C++) happily accepts non-UTF8 metadata on
fields and schemas; however, that metadata is IPC-encoded as the
`string` type in Flatbuffers which is theoretically restricted to UTF8
(apparently, the Flatbuffers validator does not check for that).

Several questions ensue:

1) Should the C++ IPC writer - and potentially other implementations -
ensure that only valid UTF8 strings can be serialized as Flatbuffers
`string`s (which would apply not only to key-value metadata strings,
but
also timezones and field names)?

2) Should Arrow C++ reject key-value metadata with non-UTF8 strings as
invalid, even if they are never serialized over IPC?

3) Should the C Data Interface recommend that type metadata keys and
values (*) be valid UTF8 as well?

(*)
https://arrow.apache.org/docs/format/CDataInterface.html#c.ArrowSchema.metadata


Thanks

Antoine.

Re: [Discuss] Field/schema/custom metadata restriction to UTF8?

Reply via email to