Le 10/02/2022 à 14:09, Alessandro Molina a écrit :
Mentioned this already to Joris, but want to make sure we don't miss it.

C-Data and thus ARROW:extension:metadata was mostly designed for shipping
data to different processes within the same host.

ARROW:extension:metadata is unrelated to the C data interface. It can be transmitted over IPC.

> Json would definitely get rid of the endianess problem at the cost of a
> greater size and a more complex parser. But there are superminimal json
> parsers designed specifically for embedding like Jasmine (
> https://github.com/zserge/jsmn )

Interesting. Jsmn is more of a lexer than a complete parser (for example, you need to decode numbers/booleans/... yourself, you need to unescape strings yourself). That said, for our use case it might be sufficient.

Regards

Antoine.


On Wed, Feb 9, 2022 at 2:51 AM Dewey Dunnington <de...@voltrondata.com>
wrote:

I'll share a bit more about geospatial extension types that Joris
mentioned. I'm new to the Arrow community and didn't know that there were
any restrictions on metadata values (the C Data interface docs don't seem
to indicate that there are restrictions, or if it's there I missed it!), so
I used the same encoding for the ARROW:extension:metadata that's used to
encode the parent metadata (int32 num_items, int32 name_len,
char[name_len], int32 value_len, char[value_len],  etc..). I did this
because I needed two key/value pairs (geodesic = true/false; crs =
some_coordinate_reference_system) and already had the code to iterate over
the parent metadata. I'm not saying that it's any pinnacle of elegant code
(still very much a prototype), but it only takes about 30 lines of C to do
this [1].

I prototyped the extension types for geospatial using the C data interface,
the idea being that a header-only helper file (geoarrow.hpp) could be
distributed that would make it an attractive and easy alternative to
well-known binary (WKB) to pass geometries around between libraries (e.g.,
GEOS, GDAL, PROJ). Requiring anybody who uses an extension type to also
vendor a JSON parser [2] seems a bit anti-social and restricts where that
extension type is useful, although I understand that it's not the use case
that many might have.

There are definitely reasonable ways to do what I'm trying to do without
resorting to a binary encoding, and JSON could probably even work...I'm
just trying to share the use-case since it seems like this kind of
environment isn't how folks envisioned extension types being used.

[1]

https://github.com/paleolimbot/geoarrow/blob/master/src/internal/geoarrow.hpp#L511-L542
[2] The commonly vendored JSON parser in geospatial libraries is this one:
https://github.com/nlohmann/json

On Tue, Feb 8, 2022 at 7:58 PM Weston Pace <weston.p...@gmail.com> wrote:

I think I'm +0 but lean slightly towards JSON.

In favor of binary I would guess that most extension types are going
to have relatively simple parameterization (to the point that
protobuf/flatbuffers isn't really needed).  For example, the substrate
consumer PR has five extension types at the moment (e.g. uuid,
varchar) and only two of them are parameterized and each of these by a
single int32_t.  It might be interesting to see what kinds of
extension types the geospatial community uses.

That being said, this sort of parsing isn't really on any kind of
critical path.  It's very likely that users (not Arrow developers)
will be creating and working with extension types.  These users are
likely going to default to JSON (or pickle or XML).  If our "well
known types" use JSON then it will be more easily recognizable to
users what is going on.

-Weston

On Tue, Feb 8, 2022 at 8:14 AM Joris Van den Bossche
<jorisvandenboss...@gmail.com> wrote:

On Tue, 8 Feb 2022 at 17:37, Jorge Cardoso Leitão <
jorgecarlei...@gmail.com>
wrote:

...

Wrt to binary, imo the challenge is:
* we state that backward incompatible changes to the c data interface
require a new spec [1]


Note that this discussion wouldn't change anything about the C Data
Interface spec itself. The discussion is only about the *value* that is
put
in one of the key-value metadata fields. The C Data Interface spec
defines
how the metadata needs to be stored, but doesn't specify anything about
the
actual value of one of the key-value metadata fields.


* we state that the metadata is a binary string [2]
* a valid string is a subset of all valid byte arrays and thus
removing "
*string*" from the spec is backward incompatible

If we write invalid utf8 to it and a reader assumes utf8 when reading
it,
we trigger undefined behavior.

I was a bit surprised by ARROW-15613 - my understanding is that the
c++
implementation is not following the spec, and if we at arrow2 were
not
be
checking for utf8, we would be exposing a vulnerability (at least
according
to Rust's standards). We just checked it out of luck (it is O(1), so
why
not).


Yes, the C++ implementation is indeed not following the spec. See the
"[DISCUSS] Binary Values in Key value pairs" thread (
https://lists.apache.org/thread/blmj0cgv34dgdxqd3ow60ln68khnz0qr).
Let's
maybe keep this part of the discussion there?



Reply via email to