Re: [DISCUSS] C-level in-process array protocol

Ben Kietzman Mon, 30 Sep 2019 13:49:38 -0700

FlatCC seems germane: https://github.com/dvidelabs/flatcc


It compiles flatbuffer schemas down to (idiomatic?) C

Perhaps the schema and batch serialization problems should be solved by
storing everything in the flatbuffer format.
Then the results of running flatcc plus a few simple helpers can be checked
in to provide an accessible C API.
With respect to lifetime, Antoine has already done good work on specifying
a move only contract which could probably be adapted.


On Sun, Sep 29, 2019 at 2:44 PM Antoine Pitrou <anto...@python.org> wrote:

>
> One basic design point is to allow exchanging Arrow data with no
> mandatory dependency (the exception is JSON and base64 if you want to
> act on metadata - but that's highly optional, and those are extremely
> widespread formats).  I'm afraid that Flatbuffers may be a deterrent:
> not only it introduces a library, but it requires the use of a compiler
> to produce generated code.  It also requires familiarizing with, well,
> Flatbuffers :-)
>
> We can of course discuss this and feel it's not a problem.
>
> Regards
>
> Antoine.
>
>
> Le 29/09/2019 à 19:47, Wes McKinney a écrit :
> > There are two pieces of serialized data needed to communicate a record
> > batch from one library to another
> >
> > * Serialized schema (i.e. what's in Schema.fbs)
> > * Serialized "data header", i.e. the "RecordBatch" message in Message.fbs
> >
> > You _do_ need to use a Flatbuffers library to fully create these
> > message types to interact with any existing record batch disassembly /
> > reassembly.
> >
> > I think I'm most concerned about having a new way to serialize
> > schemas. We already have JSON-based schema serialization for
> > integration test purposes, so one possibility is to standardize that
> > and make it a more formalized part of the project specification.
> >
> > As far as a C protocol, I don't see an especial downside to using the
> > Flatbuffers schema to communicate types.
> >
> > Another thought is to not deviate from the flattened
> > Flatbuffers-styled representation but to translate the Flatbuffers
> > types into C types: namely a C struct-based version of the
> > "RecordBatch" message.
> >
> > Independent of the means to communicate the two pieces of serialized
> > information above (respectively: schemas and record batch field memory
> > addresses and field lengths), having a C-based FFI where project can
> > drop in a header file containing the ABI they are supposed to
> > implement, that seems pretty reasonable to me.
> >
> > If we don't define a standardized in-memory FFI (whether it uses the
> > Flatbuffers objects as inputs/outputs or not) then downstream project
> > will devise their own, and that will cause issues long term.
> >
> > On Sun, Sep 29, 2019 at 2:59 AM Antoine Pitrou <anto...@python.org>
> wrote:
> >>
> >>
> >> Le 29/09/2019 à 06:10, Jacques Nadeau a écrit :
> >>> * No dependency on Flatbuffers.
> >>> * No buffer reassembly (data is already exposed in logical Arrow
> format).
> >>> * Zero-copy by design.
> >>> * Easy to reimplement from scratch.
> >>>
> >>> I don't see how the flatbuffer pattern for data headers doesn't
> accomplish
> >>> all of these things. At its definition, is a very simple
> representation of
> >>> data that could be worked with independently of the flatbuffers
> codebase.
> >>> It was designed so systems could map directly into that memory without
> >>> interacting with a flatbuffers library.
> >>>
> >>> Specifically the following three structures were designed to already
> allow
> >>> what I think this proposal is trying to recreate. All three are very
> simple
> >>> to construct in a direct, non-flatbuffer dependent read/write pattern.
> >>
> >> Are they?  Personally, I wouldn't know how to do that.  I don't know
> >> which encoding Flatbuffers use, whether it's C ABI-compatible (how could
> >> it be? if it's portable accross different platforms, then it's probably
> >> not compatible with any particular platform's C ABI, or only as a
> >> conincidence), how I'm supposed to make use of the "offset" field, or
> >> what the lifetime / ownership of all this data is.
> >>
> >> I may be missing something, but if the answer is that it's easy to
> >> reimplement Flatbuffers' encoding without relying on the Flatbuffers
> >> project's source code, I'm a bit skeptical.
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >>
> >>>
> >>> struct FieldNode {
> >>>   length: long;
> >>>   null_count: long;
> >>> }
> >>>
> >>> struct Buffer {
> >>>   offset: long;
> >>>   length: long;
> >>> }
> >>>
> >>> table RecordBatch {
> >>>   length: long;
> >>>   nodes: [FieldNode];
> >>>   buffers: [Buffer];
> >>> }
> >>>
> >>> On Sat, Sep 28, 2019 at 9:02 PM Jacques Nadeau <jacq...@apache.org>
> wrote:
> >>>
> >>>> I'm not clear on why we need to introduce something beyond what
> >>>> flatbuffers already provides. Can someone explain that to me? I'm not
> >>>> really a fan of introducing a second representation of the same data
> (as I
> >>>> understand it).
> >>>>
> >>>> On Thu, Sep 19, 2019 at 1:15 PM Wes McKinney <wesmck...@gmail.com>
> wrote:
> >>>>
> >>>>> This is helpful, I will leave some comments on the proposal when I
> >>>>> can, sometime in the next week.
> >>>>>
> >>>>> I agree that it would likely be opening a can of worms to create a
> >>>>> semantic mapping between a generalized type grammar and Arrow's
> >>>>> specific logical types defined in Schema.fbs. If we go down this
> >>>>> route, we should probably utilize the simplest possible grammar that
> >>>>> is capable of encoding the Type Flatbuffers union values.
> >>>>>
> >>>>> On Thu, Sep 19, 2019 at 2:49 PM Antoine Pitrou <solip...@pitrou.net>
> >>>>> wrote:
> >>>>>>
> >>>>>>
> >>>>>> I've posted a draft specification PR here, this should help orient
> the
> >>>>>> discussion a bit:
> >>>>>> https://github.com/apache/arrow/pull/5442
> >>>>>>
> >>>>>> Regards
> >>>>>>
> >>>>>> Antoine.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Wed, 18 Sep 2019 19:52:38 +0200
> >>>>>> Antoine Pitrou <anto...@python.org> wrote:
> >>>>>>> Hello,
> >>>>>>>
> >>>>>>> One thing that was discussed in the sync call is the ability to
> easily
> >>>>>>> pass arrays at runtime between Arrow implementations or
> >>>>> Arrow-supporting
> >>>>>>> libraries in the same process, without bearing the cost of linking
> to
> >>>>>>> e.g. the C++ Arrow library.
> >>>>>>>
> >>>>>>> (for example: "Duckdb wants to provide an option to return Arrow
> data
> >>>>> of
> >>>>>>> result sets, but they don't like having Arrow as a dependency")
> >>>>>>>
> >>>>>>> One possibility would be to define a C-level protocol similar in
> >>>>> spirit
> >>>>>>> to the Python buffer protocol, which some people may be familiar
> with
> >>>>> (*).
> >>>>>>>
> >>>>>>> The basic idea is to define a simple C struct, which is ABI-stable
> and
> >>>>>>> describes an Arrow away adequately.  The struct can be
> >>>>> stack-allocated.
> >>>>>>> Its definition can also be copied in another project (or interfaced
> >>>>> with
> >>>>>>> using a C FFI layer, depending on the language).
> >>>>>>>
> >>>>>>> There is no formal proposal, this message is meant to stir the
> >>>>> discussion.
> >>>>>>>
> >>>>>>> Issues to work out:
> >>>>>>>
> >>>>>>> * Memory lifetime issues: where Python simply associates the
> Py_buffer
> >>>>>>> with a PyObject owner (a garbage-collected Python object), we need
> >>>>>>> another means to control lifetime of pointed areas.  One simple
> >>>>>>> possibility is to include a destructor function pointer in the
> >>>>> protocol
> >>>>>>> struct.
> >>>>>>>
> >>>>>>> * Arrow type representation.  We probably need some kind of
> "format"
> >>>>>>> mini-language to represent Arrow types, so that a type can be
> >>>>> described
> >>>>>>> using a `const char*`.  Ideally, primitives types at least should
> be
> >>>>>>> trivially parsable.  We may take inspiration from Python here
> >>>>> (`struct`
> >>>>>>> module format characters, PEP 3118 format additions).
> >>>>>>>
> >>>>>>> Example C struct definition (not a formal proposal!):
> >>>>>>>
> >>>>>>> struct ArrowBuffer {
> >>>>>>>   void* data;
> >>>>>>>   int64_t nbytes;
> >>>>>>>   // Called by the consumer when it doesn't need the buffer anymore
> >>>>>>>   void (*release)(struct ArrowBuffer*);
> >>>>>>>   // Opaque user data (for e.g. the release callback)
> >>>>>>>   void* user_data;
> >>>>>>> };
> >>>>>>>
> >>>>>>> struct ArrowArray {
> >>>>>>>   // Type description
> >>>>>>>   const char* format;
> >>>>>>>   // Data description
> >>>>>>>   int64_t length;
> >>>>>>>   int64_t null_count;
> >>>>>>>   int64_t n_buffers;
> >>>>>>>   // Note: this pointers are probably owned by the ArrowArray
> struct
> >>>>>>>   // and will be released and free()ed by the release callback.
> >>>>>>>   struct BufferDescriptor* buffers;
> >>>>>>>   struct ArrowDescriptor* dictionary;
> >>>>>>>   // Called by the consumer when it doesn't need the array anymore
> >>>>>>>   void (*release)(struct ArrowArrayDescriptor*);
> >>>>>>>   // Opaque user data (for e.g. the release callback)
> >>>>>>>   void* user_data;
> >>>>>>> };
> >>>>>>>
> >>>>>>> Thoughts?
> >>>>>>>
> >>>>>>> (*) For the record, the reference for the Python buffer protocol:
> >>>>>>> https://docs.python.org/3/c-api/buffer.html#buffer-structure
> >>>>>>> and its C struct definition:
> >>>>>>>
> >>>>>
> https://github.com/python/cpython/blob/v3.7.4/Include/object.h#L181-L195
> >>>>>>>
> >>>>>>> Regards
> >>>>>>>
> >>>>>>> Antoine.
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
>

Re: [DISCUSS] C-level in-process array protocol

Reply via email to