I'll share a bit more about geospatial extension types that Joris mentioned. I'm new to the Arrow community and didn't know that there were any restrictions on metadata values (the C Data interface docs don't seem to indicate that there are restrictions, or if it's there I missed it!), so I used the same encoding for the ARROW:extension:metadata that's used to encode the parent metadata (int32 num_items, int32 name_len, char[name_len], int32 value_len, char[value_len], etc..). I did this because I needed two key/value pairs (geodesic = true/false; crs = some_coordinate_reference_system) and already had the code to iterate over the parent metadata. I'm not saying that it's any pinnacle of elegant code (still very much a prototype), but it only takes about 30 lines of C to do this [1].
I prototyped the extension types for geospatial using the C data interface, the idea being that a header-only helper file (geoarrow.hpp) could be distributed that would make it an attractive and easy alternative to well-known binary (WKB) to pass geometries around between libraries (e.g., GEOS, GDAL, PROJ). Requiring anybody who uses an extension type to also vendor a JSON parser [2] seems a bit anti-social and restricts where that extension type is useful, although I understand that it's not the use case that many might have. There are definitely reasonable ways to do what I'm trying to do without resorting to a binary encoding, and JSON could probably even work...I'm just trying to share the use-case since it seems like this kind of environment isn't how folks envisioned extension types being used. [1] https://github.com/paleolimbot/geoarrow/blob/master/src/internal/geoarrow.hpp#L511-L542 [2] The commonly vendored JSON parser in geospatial libraries is this one: https://github.com/nlohmann/json On Tue, Feb 8, 2022 at 7:58 PM Weston Pace <weston.p...@gmail.com> wrote: > I think I'm +0 but lean slightly towards JSON. > > In favor of binary I would guess that most extension types are going > to have relatively simple parameterization (to the point that > protobuf/flatbuffers isn't really needed). For example, the substrate > consumer PR has five extension types at the moment (e.g. uuid, > varchar) and only two of them are parameterized and each of these by a > single int32_t. It might be interesting to see what kinds of > extension types the geospatial community uses. > > That being said, this sort of parsing isn't really on any kind of > critical path. It's very likely that users (not Arrow developers) > will be creating and working with extension types. These users are > likely going to default to JSON (or pickle or XML). If our "well > known types" use JSON then it will be more easily recognizable to > users what is going on. > > -Weston > > On Tue, Feb 8, 2022 at 8:14 AM Joris Van den Bossche > <jorisvandenboss...@gmail.com> wrote: > > > > On Tue, 8 Feb 2022 at 17:37, Jorge Cardoso Leitão < > jorgecarlei...@gmail.com> > > wrote: > > > > > ... > > > > > > Wrt to binary, imo the challenge is: > > > * we state that backward incompatible changes to the c data interface > > > require a new spec [1] > > > > > > > Note that this discussion wouldn't change anything about the C Data > > Interface spec itself. The discussion is only about the *value* that is > put > > in one of the key-value metadata fields. The C Data Interface spec > defines > > how the metadata needs to be stored, but doesn't specify anything about > the > > actual value of one of the key-value metadata fields. > > > > > > > * we state that the metadata is a binary string [2] > > > * a valid string is a subset of all valid byte arrays and thus > removing " > > > *string*" from the spec is backward incompatible > > > > > > If we write invalid utf8 to it and a reader assumes utf8 when reading > it, > > > we trigger undefined behavior. > > > > > > I was a bit surprised by ARROW-15613 - my understanding is that the c++ > > > implementation is not following the spec, and if we at arrow2 were not > be > > > checking for utf8, we would be exposing a vulnerability (at least > according > > > to Rust's standards). We just checked it out of luck (it is O(1), so > why > > > not). > > > > > > > Yes, the C++ implementation is indeed not following the spec. See the > > "[DISCUSS] Binary Values in Key value pairs" thread ( > > https://lists.apache.org/thread/blmj0cgv34dgdxqd3ow60ln68khnz0qr). Let's > > maybe keep this part of the discussion there? >