Re: [DISCUSS] C-level in-process array protocol

Wes McKinney Mon, 30 Sep 2019 15:40:46 -0700

A couple things:

* I think a C protocol / FFI for Arrow array/vectors would be better
to have the same "shape" as an assembled array. Note that the C
structs here have very nearly the same "shape" as the data structure
representing a C++ Array object [1]. The disassembly and reassembly
here is substantially simpler than the IPC protocol. A recursive
structure in Flatbuffers would make RecordBatch messages much larger,
so the flattened / disassembled representation we use for serialized
record batches is the correct one


* The "formal" C protocol having the "assembled" shape means that many
minimal Arrow users won't have to implement any separate data
structures. They can just use the C struct directly or a slightly
wrapped version thereof with some convenience functions.

* I think that requiring building a Flatbuffer for minimal use cases
(e.g. communicating simple record batches with primitive types) passes
on implementation burden to minimal users.

I think the mantra of the C protocol should be the following:

* Users of the protocol have to write little to no code to use it. For
example, populating an INT32 array should require only a few lines of
code
* The data structure in the protocol is suitable as an in-memory data
structure for recursive assembly of nested structures

I think that having a string miniformat or a pre-parsed type struct
with enum values (along the lines of what Antoine is describing above)
places less burden on downstream users.

[1]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L203

On Mon, Sep 30, 2019 at 4:08 PM Antoine Pitrou <anto...@python.org> wrote:
>
>
> FlatCC is still a dependency, with generated files etc.
> Perhaps you want to evaluate FlatCC on a schema-like example and see
> what the generated code and compile instructions look like?
>
> I'll point out again that the format string in my proposal uses an
> extremely simple mini-format, that should be parsable very easily by any
> developer, even in raw C:
> https://github.com/apache/arrow/blob/3806fa9ba3ddf95f0d09b865071bf19c5e912756/docs/source/format/CProtocol.rst#data-type-description----format-strings
>
> The parent-child structure in the schema is represented as-is in the
> ArrowArray parent-child relationship, so it doesn't need any encoding.
> Using Flatbuffers for an enum-like field + (at most) a couple parameters
> sounds overkill.
>
> Another possibility would be to replace the format string with
> pre-parsed fields, for example:
>
>   int32_t type;
>   int32_t subtype;      // type-dependent (e.g. unit for temporal types)
>   int32_t type_width;   // for width-parametered types
>   const int8_t* child_ids;   // for unions
>   const char* auxiliary_type_param;  // e.g. timezone for timestamp
>
> The downside is that there are more fields to consider (including two
> optional pointers).
>
> Regards
>
> Antoine.
>
>
> Le 30/09/2019 à 22:48, Ben Kietzman a écrit :
> > FlatCC seems germane: https://github.com/dvidelabs/flatcc
> >
> > It compiles flatbuffer schemas down to (idiomatic?) C
> >
> > Perhaps the schema and batch serialization problems should be solved by
> > storing everything in the flatbuffer format.
> > Then the results of running flatcc plus a few simple helpers can be checked
> > in to provide an accessible C API.
> > With respect to lifetime, Antoine has already done good work on specifying
> > a move only contract which could probably be adapted.
> >
> >
> > On Sun, Sep 29, 2019 at 2:44 PM Antoine Pitrou <anto...@python.org> wrote:
> >
> >>
> >> One basic design point is to allow exchanging Arrow data with no
> >> mandatory dependency (the exception is JSON and base64 if you want to
> >> act on metadata - but that's highly optional, and those are extremely
> >> widespread formats).  I'm afraid that Flatbuffers may be a deterrent:
> >> not only it introduces a library, but it requires the use of a compiler
> >> to produce generated code.  It also requires familiarizing with, well,
> >> Flatbuffers :-)
> >>
> >> We can of course discuss this and feel it's not a problem.
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >>
> >> Le 29/09/2019 à 19:47, Wes McKinney a écrit :
> >>> There are two pieces of serialized data needed to communicate a record
> >>> batch from one library to another
> >>>
> >>> * Serialized schema (i.e. what's in Schema.fbs)
> >>> * Serialized "data header", i.e. the "RecordBatch" message in Message.fbs
> >>>
> >>> You _do_ need to use a Flatbuffers library to fully create these
> >>> message types to interact with any existing record batch disassembly /
> >>> reassembly.
> >>>
> >>> I think I'm most concerned about having a new way to serialize
> >>> schemas. We already have JSON-based schema serialization for
> >>> integration test purposes, so one possibility is to standardize that
> >>> and make it a more formalized part of the project specification.
> >>>
> >>> As far as a C protocol, I don't see an especial downside to using the
> >>> Flatbuffers schema to communicate types.
> >>>
> >>> Another thought is to not deviate from the flattened
> >>> Flatbuffers-styled representation but to translate the Flatbuffers
> >>> types into C types: namely a C struct-based version of the
> >>> "RecordBatch" message.
> >>>
> >>> Independent of the means to communicate the two pieces of serialized
> >>> information above (respectively: schemas and record batch field memory
> >>> addresses and field lengths), having a C-based FFI where project can
> >>> drop in a header file containing the ABI they are supposed to
> >>> implement, that seems pretty reasonable to me.
> >>>
> >>> If we don't define a standardized in-memory FFI (whether it uses the
> >>> Flatbuffers objects as inputs/outputs or not) then downstream project
> >>> will devise their own, and that will cause issues long term.
> >>>
> >>> On Sun, Sep 29, 2019 at 2:59 AM Antoine Pitrou <anto...@python.org>
> >> wrote:
> >>>>
> >>>>
> >>>> Le 29/09/2019 à 06:10, Jacques Nadeau a écrit :
> >>>>> * No dependency on Flatbuffers.
> >>>>> * No buffer reassembly (data is already exposed in logical Arrow
> >> format).
> >>>>> * Zero-copy by design.
> >>>>> * Easy to reimplement from scratch.
> >>>>>
> >>>>> I don't see how the flatbuffer pattern for data headers doesn't
> >> accomplish
> >>>>> all of these things. At its definition, is a very simple
> >> representation of
> >>>>> data that could be worked with independently of the flatbuffers
> >> codebase.
> >>>>> It was designed so systems could map directly into that memory without
> >>>>> interacting with a flatbuffers library.
> >>>>>
> >>>>> Specifically the following three structures were designed to already
> >> allow
> >>>>> what I think this proposal is trying to recreate. All three are very
> >> simple
> >>>>> to construct in a direct, non-flatbuffer dependent read/write pattern.
> >>>>
> >>>> Are they?  Personally, I wouldn't know how to do that.  I don't know
> >>>> which encoding Flatbuffers use, whether it's C ABI-compatible (how could
> >>>> it be? if it's portable accross different platforms, then it's probably
> >>>> not compatible with any particular platform's C ABI, or only as a
> >>>> conincidence), how I'm supposed to make use of the "offset" field, or
> >>>> what the lifetime / ownership of all this data is.
> >>>>
> >>>> I may be missing something, but if the answer is that it's easy to
> >>>> reimplement Flatbuffers' encoding without relying on the Flatbuffers
> >>>> project's source code, I'm a bit skeptical.
> >>>>
> >>>> Regards
> >>>>
> >>>> Antoine.
> >>>>
> >>>>
> >>>>>
> >>>>> struct FieldNode {
> >>>>>   length: long;
> >>>>>   null_count: long;
> >>>>> }
> >>>>>
> >>>>> struct Buffer {
> >>>>>   offset: long;
> >>>>>   length: long;
> >>>>> }
> >>>>>
> >>>>> table RecordBatch {
> >>>>>   length: long;
> >>>>>   nodes: [FieldNode];
> >>>>>   buffers: [Buffer];
> >>>>> }
> >>>>>
> >>>>> On Sat, Sep 28, 2019 at 9:02 PM Jacques Nadeau <jacq...@apache.org>
> >> wrote:
> >>>>>
> >>>>>> I'm not clear on why we need to introduce something beyond what
> >>>>>> flatbuffers already provides. Can someone explain that to me? I'm not
> >>>>>> really a fan of introducing a second representation of the same data
> >> (as I
> >>>>>> understand it).
> >>>>>>
> >>>>>> On Thu, Sep 19, 2019 at 1:15 PM Wes McKinney <wesmck...@gmail.com>
> >> wrote:
> >>>>>>
> >>>>>>> This is helpful, I will leave some comments on the proposal when I
> >>>>>>> can, sometime in the next week.
> >>>>>>>
> >>>>>>> I agree that it would likely be opening a can of worms to create a
> >>>>>>> semantic mapping between a generalized type grammar and Arrow's
> >>>>>>> specific logical types defined in Schema.fbs. If we go down this
> >>>>>>> route, we should probably utilize the simplest possible grammar that
> >>>>>>> is capable of encoding the Type Flatbuffers union values.
> >>>>>>>
> >>>>>>> On Thu, Sep 19, 2019 at 2:49 PM Antoine Pitrou <solip...@pitrou.net>
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> I've posted a draft specification PR here, this should help orient
> >> the
> >>>>>>>> discussion a bit:
> >>>>>>>> https://github.com/apache/arrow/pull/5442
> >>>>>>>>
> >>>>>>>> Regards
> >>>>>>>>
> >>>>>>>> Antoine.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Wed, 18 Sep 2019 19:52:38 +0200
> >>>>>>>> Antoine Pitrou <anto...@python.org> wrote:
> >>>>>>>>> Hello,
> >>>>>>>>>
> >>>>>>>>> One thing that was discussed in the sync call is the ability to
> >> easily
> >>>>>>>>> pass arrays at runtime between Arrow implementations or
> >>>>>>> Arrow-supporting
> >>>>>>>>> libraries in the same process, without bearing the cost of linking
> >> to
> >>>>>>>>> e.g. the C++ Arrow library.
> >>>>>>>>>
> >>>>>>>>> (for example: "Duckdb wants to provide an option to return Arrow
> >> data
> >>>>>>> of
> >>>>>>>>> result sets, but they don't like having Arrow as a dependency")
> >>>>>>>>>
> >>>>>>>>> One possibility would be to define a C-level protocol similar in
> >>>>>>> spirit
> >>>>>>>>> to the Python buffer protocol, which some people may be familiar
> >> with
> >>>>>>> (*).
> >>>>>>>>>
> >>>>>>>>> The basic idea is to define a simple C struct, which is ABI-stable
> >> and
> >>>>>>>>> describes an Arrow away adequately.  The struct can be
> >>>>>>> stack-allocated.
> >>>>>>>>> Its definition can also be copied in another project (or interfaced
> >>>>>>> with
> >>>>>>>>> using a C FFI layer, depending on the language).
> >>>>>>>>>
> >>>>>>>>> There is no formal proposal, this message is meant to stir the
> >>>>>>> discussion.
> >>>>>>>>>
> >>>>>>>>> Issues to work out:
> >>>>>>>>>
> >>>>>>>>> * Memory lifetime issues: where Python simply associates the
> >> Py_buffer
> >>>>>>>>> with a PyObject owner (a garbage-collected Python object), we need
> >>>>>>>>> another means to control lifetime of pointed areas.  One simple
> >>>>>>>>> possibility is to include a destructor function pointer in the
> >>>>>>> protocol
> >>>>>>>>> struct.
> >>>>>>>>>
> >>>>>>>>> * Arrow type representation.  We probably need some kind of
> >> "format"
> >>>>>>>>> mini-language to represent Arrow types, so that a type can be
> >>>>>>> described
> >>>>>>>>> using a `const char*`.  Ideally, primitives types at least should
> >> be
> >>>>>>>>> trivially parsable.  We may take inspiration from Python here
> >>>>>>> (`struct`
> >>>>>>>>> module format characters, PEP 3118 format additions).
> >>>>>>>>>
> >>>>>>>>> Example C struct definition (not a formal proposal!):
> >>>>>>>>>
> >>>>>>>>> struct ArrowBuffer {
> >>>>>>>>>   void* data;
> >>>>>>>>>   int64_t nbytes;
> >>>>>>>>>   // Called by the consumer when it doesn't need the buffer anymore
> >>>>>>>>>   void (*release)(struct ArrowBuffer*);
> >>>>>>>>>   // Opaque user data (for e.g. the release callback)
> >>>>>>>>>   void* user_data;
> >>>>>>>>> };
> >>>>>>>>>
> >>>>>>>>> struct ArrowArray {
> >>>>>>>>>   // Type description
> >>>>>>>>>   const char* format;
> >>>>>>>>>   // Data description
> >>>>>>>>>   int64_t length;
> >>>>>>>>>   int64_t null_count;
> >>>>>>>>>   int64_t n_buffers;
> >>>>>>>>>   // Note: this pointers are probably owned by the ArrowArray
> >> struct
> >>>>>>>>>   // and will be released and free()ed by the release callback.
> >>>>>>>>>   struct BufferDescriptor* buffers;
> >>>>>>>>>   struct ArrowDescriptor* dictionary;
> >>>>>>>>>   // Called by the consumer when it doesn't need the array anymore
> >>>>>>>>>   void (*release)(struct ArrowArrayDescriptor*);
> >>>>>>>>>   // Opaque user data (for e.g. the release callback)
> >>>>>>>>>   void* user_data;
> >>>>>>>>> };
> >>>>>>>>>
> >>>>>>>>> Thoughts?
> >>>>>>>>>
> >>>>>>>>> (*) For the record, the reference for the Python buffer protocol:
> >>>>>>>>> https://docs.python.org/3/c-api/buffer.html#buffer-structure
> >>>>>>>>> and its C struct definition:
> >>>>>>>>>
> >>>>>>>
> >> https://github.com/python/cpython/blob/v3.7.4/Include/object.h#L181-L195
> >>>>>>>>>
> >>>>>>>>> Regards
> >>>>>>>>>
> >>>>>>>>> Antoine.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>
> >

Re: [DISCUSS] C-level in-process array protocol

Reply via email to