I'm new to this discussion and Arrow generally and appreciate Joris bringing this up. While the forward- and backward-compatability of this is over my head, I think that it is important to provide a path for extension types to have serialized metadata as binary because the alternatives are (in my opinion) not great and might lead to extension developers inventing their own key/value encoding.
With binary on the table there are options like protobuf, flatbuffers, or memcpy of a struct that extension type implementations can use depending on the complexity of the extension metadata that needs to be communicated and whether or not the metadata is intended to be used in compiled code (i.e., are those values needed to do anything useful with the bytes in the buffers?). Without binary on the table, an extension type implementation has to vendor in a JSON reader/writer, vendor in a base64 encoder/decoder, or invent some custom serialization that's easier to parse than either of those. Cheers, -dewey On Fri, Feb 4, 2022 at 7:06 AM Joris Van den Bossche < jorisvandenboss...@gmail.com> wrote: > Reviving this thread, I don't think anything happened in the meantime on > this topic? > > From rereading the thread, it seems David mentioned two possible ideas: > - A new [byte] binary_value field in the existing KeyValue type, next to > the existing string value field. And if you have valid string metadata, the > same value can be put in both fields by reusing the offset > - Relax the existing string value field to [byte], but also require > implementations to null-terminate binary values regardless to ensure > compatibility for old readers > - And in addition to those options, there is also the safest option of > adding a new BinaryKeyValue type and add that next to KeyValue everywhere > this is used. > > And we need to further assess which option is viable given the > forward/backward compatibility constraints? > > I want to point out though that in pyarrow we already _do_ use binary > metadata for extension types for several years. The built-in > PyExtensionType class uses a pickle dump as the serialized metadata (so > which is put in the ARROW:extension:metadata key in the key-value metadata > of the Field), and the bytes from a pickle dump are certainly not valid > UTF-8 I think. > I don't know how much this is used in the wild. In pandas, our arrow > extension types don't use this pickle-based PyExtensionType but define > their own serialization which currently returns a valid UTF-8 string. But > for example the Ray project seems to use PyExtensionType ( > > https://docs.ray.io/en/latest/_modules/ray/data/extensions/tensor_extension.html#ArrowTensorType > ). > > Joris > > On Thu, 2 Sept 2021 at 07:01, Micah Kornfield <emkornfi...@gmail.com> > wrote: > > > It still sounds like adding a new type might be the safest approach (and > > marking the old type as discouraged). > > > > On Mon, Aug 23, 2021 at 11:18 AM David Li <lidav...@apache.org> wrote: > > > > > I believe so. > > > > > > The encoding of a string in Flatbuffers is [byte] with a null > terminator > > > not included in the length, so old files should still be readable (they > > > would simply not see the terminator anymore). And conversely, > continuing > > to > > > write the null terminator means new files should still be readable by > > > existing implementations, so long as they don't have binary metadata > (but > > > we could increment the metadata version before allowing that). There > > still > > > remains the issue of non-UTF8 data inside the "string" value; on that > > > point, this is a non-answer (or I guess, it canonicalizes the current > > > unintentional behavior). > > > > > > It would complicate the Java API as that currently works with Strings, > > but > > > that will be an issue for any attempt to add binary metadata. > > > > > > -David > > > > > > On Mon, Aug 23, 2021, at 13:40, Antoine Pitrou wrote: > > > > > > > > Le 23/08/2021 à 17:52, David Li a écrit : > > > > > Another way forward might be to relax the value type to [byte], but > > > also require implementations to null-terminate binary values > regardless. > > > The C++ Flatbuffers implementation does this already [1] (though not > the > > > Java one [2]). Old implementations validating UTF8-ness would still be > > > unable to read anything with binary metadata (which is the case they're > > > already in), but implementations just validating for null-termination > > would > > > continue to work. And again, I believe this is in-spec for Flatbuffers > > > since string lengths don't include the null terminator. > > > > > > > > > > [1]: > > > > > > https://github.com/google/flatbuffers/blob/273f6084e55285e26ff8b4cdd55a9c0e2d9a48b7/include/flatbuffers/flatbuffers.h#L1612-L1670 > > > > > [2]: > > > > > > https://github.com/google/flatbuffers/blob/273f6084e55285e26ff8b4cdd55a9c0e2d9a48b7/java/com/google/flatbuffers/FlatBufferBuilder.java#L594-L637 > > > > > Though it seems we could emulate C++ here by calling addByte(0) > > > ourselves before createByteVector. > > > > > > > > Would that allow for both forwards and backwards compatibility when > the > > > > metadata is valid UTF8? > > > > > > > > Regards > > > > > > > > Antoine. > > > > > > > > > > > > > >