I’ve been collaborating with Steven Phillips (who’s been working on
the Java Arrow impl recently) to show a proof of concept ping-ponging
Arrow data back and forth between the Java and C++ implementations. We
aren’t 100% there yet, but I got C++ to C++ round-trip to memory map
working today (for primitive types — e.g. integers):

https://github.com/apache/arrow/pull/28

We created a small metadata specification using Flatbuffers IDL —
feedback would be much desired here:

https://github.com/apache/arrow/pull/28/files#diff-520b20e87eb508faa3cc7aa9855030d7

This includes:

- Logical schemas
- Data headers: compact descriptions row batches associated with a
particular schema

The idea is that two systems agree up front on “what is the schema” so
that only the data header (containing memory offsets and sizes and
some other important data-dependent metadata). After working through
this in some real code, I’m feeling fairly good that it meets the
needs of Arrow for the time being, but there may be some unknown
requirements that it would be good to learn about sooner than later.
After some design review and iteration we’ll want to document the
metadata specification as part of the format in more gory detail.

(Note: We are using Flatbuffers for convenience, performance, and
development simplicity — one feature that is especially nice is its
union support, but it can be done in other serialization tools, too)

It's be great to get some benchmark code written so that we are also
able to make technical decisions on the basis of measurable
performance implications. For example, while the read-path of the
above code does not copy any data, it would be useful to know how fast
reassembling the row batch data structure is and how this scales with
the number of columns.

best regards,
Wes

Reply via email to