I'll share a bit more about geospatial extension types that Joris
mentioned. I'm new to the Arrow community and didn't know that there were
any restrictions on metadata values (the C Data interface docs don't seem
to indicate that there are restrictions, or if it's there I missed it!), so
I used the same encoding for the ARROW:extension:metadata that's used to
encode the parent metadata (int32 num_items, int32 name_len,
char[name_len], int32 value_len, char[value_len],  etc..). I did this
because I needed two key/value pairs (geodesic = true/false; crs =
some_coordinate_reference_system) and already had the code to iterate over
the parent metadata. I'm not saying that it's any pinnacle of elegant code
(still very much a prototype), but it only takes about 30 lines of C to do
this [1].

I prototyped the extension types for geospatial using the C data interface,
the idea being that a header-only helper file (geoarrow.hpp) could be
distributed that would make it an attractive and easy alternative to
well-known binary (WKB) to pass geometries around between libraries (e.g.,
GEOS, GDAL, PROJ). Requiring anybody who uses an extension type to also
vendor a JSON parser [2] seems a bit anti-social and restricts where that
extension type is useful, although I understand that it's not the use case
that many might have.

There are definitely reasonable ways to do what I'm trying to do without
resorting to a binary encoding, and JSON could probably even work...I'm
just trying to share the use-case since it seems like this kind of
environment isn't how folks envisioned extension types being used.

[1]
https://github.com/paleolimbot/geoarrow/blob/master/src/internal/geoarrow.hpp#L511-L542
[2] The commonly vendored JSON parser in geospatial libraries is this one:
https://github.com/nlohmann/json

On Tue, Feb 8, 2022 at 7:58 PM Weston Pace <weston.p...@gmail.com> wrote:

> I think I'm +0 but lean slightly towards JSON.
>
> In favor of binary I would guess that most extension types are going
> to have relatively simple parameterization (to the point that
> protobuf/flatbuffers isn't really needed).  For example, the substrate
> consumer PR has five extension types at the moment (e.g. uuid,
> varchar) and only two of them are parameterized and each of these by a
> single int32_t.  It might be interesting to see what kinds of
> extension types the geospatial community uses.
>
> That being said, this sort of parsing isn't really on any kind of
> critical path.  It's very likely that users (not Arrow developers)
> will be creating and working with extension types.  These users are
> likely going to default to JSON (or pickle or XML).  If our "well
> known types" use JSON then it will be more easily recognizable to
> users what is going on.
>
> -Weston
>
> On Tue, Feb 8, 2022 at 8:14 AM Joris Van den Bossche
> <jorisvandenboss...@gmail.com> wrote:
> >
> > On Tue, 8 Feb 2022 at 17:37, Jorge Cardoso Leitão <
> jorgecarlei...@gmail.com>
> > wrote:
> >
> > > ...
> > >
> > > Wrt to binary, imo the challenge is:
> > > * we state that backward incompatible changes to the c data interface
> > > require a new spec [1]
> > >
> >
> > Note that this discussion wouldn't change anything about the C Data
> > Interface spec itself. The discussion is only about the *value* that is
> put
> > in one of the key-value metadata fields. The C Data Interface spec
> defines
> > how the metadata needs to be stored, but doesn't specify anything about
> the
> > actual value of one of the key-value metadata fields.
> >
> >
> > > * we state that the metadata is a binary string [2]
> > > * a valid string is a subset of all valid byte arrays and thus
> removing "
> > > *string*" from the spec is backward incompatible
> > >
> > > If we write invalid utf8 to it and a reader assumes utf8 when reading
> it,
> > > we trigger undefined behavior.
> > >
> > > I was a bit surprised by ARROW-15613 - my understanding is that the c++
> > > implementation is not following the spec, and if we at arrow2 were not
> be
> > > checking for utf8, we would be exposing a vulnerability (at least
> according
> > > to Rust's standards). We just checked it out of luck (it is O(1), so
> why
> > > not).
> > >
> >
> > Yes, the C++ implementation is indeed not following the spec. See the
> > "[DISCUSS] Binary Values in Key value pairs" thread (
> > https://lists.apache.org/thread/blmj0cgv34dgdxqd3ow60ln68khnz0qr). Let's
> > maybe keep this part of the discussion there?
>

Reply via email to