FlatCC seems germane: https://github.com/dvidelabs/flatcc
It compiles flatbuffer schemas down to (idiomatic?) C Perhaps the schema and batch serialization problems should be solved by storing everything in the flatbuffer format. Then the results of running flatcc plus a few simple helpers can be checked in to provide an accessible C API. With respect to lifetime, Antoine has already done good work on specifying a move only contract which could probably be adapted. On Sun, Sep 29, 2019 at 2:44 PM Antoine Pitrou <anto...@python.org> wrote: > > One basic design point is to allow exchanging Arrow data with no > mandatory dependency (the exception is JSON and base64 if you want to > act on metadata - but that's highly optional, and those are extremely > widespread formats). I'm afraid that Flatbuffers may be a deterrent: > not only it introduces a library, but it requires the use of a compiler > to produce generated code. It also requires familiarizing with, well, > Flatbuffers :-) > > We can of course discuss this and feel it's not a problem. > > Regards > > Antoine. > > > Le 29/09/2019 à 19:47, Wes McKinney a écrit : > > There are two pieces of serialized data needed to communicate a record > > batch from one library to another > > > > * Serialized schema (i.e. what's in Schema.fbs) > > * Serialized "data header", i.e. the "RecordBatch" message in Message.fbs > > > > You _do_ need to use a Flatbuffers library to fully create these > > message types to interact with any existing record batch disassembly / > > reassembly. > > > > I think I'm most concerned about having a new way to serialize > > schemas. We already have JSON-based schema serialization for > > integration test purposes, so one possibility is to standardize that > > and make it a more formalized part of the project specification. > > > > As far as a C protocol, I don't see an especial downside to using the > > Flatbuffers schema to communicate types. > > > > Another thought is to not deviate from the flattened > > Flatbuffers-styled representation but to translate the Flatbuffers > > types into C types: namely a C struct-based version of the > > "RecordBatch" message. > > > > Independent of the means to communicate the two pieces of serialized > > information above (respectively: schemas and record batch field memory > > addresses and field lengths), having a C-based FFI where project can > > drop in a header file containing the ABI they are supposed to > > implement, that seems pretty reasonable to me. > > > > If we don't define a standardized in-memory FFI (whether it uses the > > Flatbuffers objects as inputs/outputs or not) then downstream project > > will devise their own, and that will cause issues long term. > > > > On Sun, Sep 29, 2019 at 2:59 AM Antoine Pitrou <anto...@python.org> > wrote: > >> > >> > >> Le 29/09/2019 à 06:10, Jacques Nadeau a écrit : > >>> * No dependency on Flatbuffers. > >>> * No buffer reassembly (data is already exposed in logical Arrow > format). > >>> * Zero-copy by design. > >>> * Easy to reimplement from scratch. > >>> > >>> I don't see how the flatbuffer pattern for data headers doesn't > accomplish > >>> all of these things. At its definition, is a very simple > representation of > >>> data that could be worked with independently of the flatbuffers > codebase. > >>> It was designed so systems could map directly into that memory without > >>> interacting with a flatbuffers library. > >>> > >>> Specifically the following three structures were designed to already > allow > >>> what I think this proposal is trying to recreate. All three are very > simple > >>> to construct in a direct, non-flatbuffer dependent read/write pattern. > >> > >> Are they? Personally, I wouldn't know how to do that. I don't know > >> which encoding Flatbuffers use, whether it's C ABI-compatible (how could > >> it be? if it's portable accross different platforms, then it's probably > >> not compatible with any particular platform's C ABI, or only as a > >> conincidence), how I'm supposed to make use of the "offset" field, or > >> what the lifetime / ownership of all this data is. > >> > >> I may be missing something, but if the answer is that it's easy to > >> reimplement Flatbuffers' encoding without relying on the Flatbuffers > >> project's source code, I'm a bit skeptical. > >> > >> Regards > >> > >> Antoine. > >> > >> > >>> > >>> struct FieldNode { > >>> length: long; > >>> null_count: long; > >>> } > >>> > >>> struct Buffer { > >>> offset: long; > >>> length: long; > >>> } > >>> > >>> table RecordBatch { > >>> length: long; > >>> nodes: [FieldNode]; > >>> buffers: [Buffer]; > >>> } > >>> > >>> On Sat, Sep 28, 2019 at 9:02 PM Jacques Nadeau <jacq...@apache.org> > wrote: > >>> > >>>> I'm not clear on why we need to introduce something beyond what > >>>> flatbuffers already provides. Can someone explain that to me? I'm not > >>>> really a fan of introducing a second representation of the same data > (as I > >>>> understand it). > >>>> > >>>> On Thu, Sep 19, 2019 at 1:15 PM Wes McKinney <wesmck...@gmail.com> > wrote: > >>>> > >>>>> This is helpful, I will leave some comments on the proposal when I > >>>>> can, sometime in the next week. > >>>>> > >>>>> I agree that it would likely be opening a can of worms to create a > >>>>> semantic mapping between a generalized type grammar and Arrow's > >>>>> specific logical types defined in Schema.fbs. If we go down this > >>>>> route, we should probably utilize the simplest possible grammar that > >>>>> is capable of encoding the Type Flatbuffers union values. > >>>>> > >>>>> On Thu, Sep 19, 2019 at 2:49 PM Antoine Pitrou <solip...@pitrou.net> > >>>>> wrote: > >>>>>> > >>>>>> > >>>>>> I've posted a draft specification PR here, this should help orient > the > >>>>>> discussion a bit: > >>>>>> https://github.com/apache/arrow/pull/5442 > >>>>>> > >>>>>> Regards > >>>>>> > >>>>>> Antoine. > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Wed, 18 Sep 2019 19:52:38 +0200 > >>>>>> Antoine Pitrou <anto...@python.org> wrote: > >>>>>>> Hello, > >>>>>>> > >>>>>>> One thing that was discussed in the sync call is the ability to > easily > >>>>>>> pass arrays at runtime between Arrow implementations or > >>>>> Arrow-supporting > >>>>>>> libraries in the same process, without bearing the cost of linking > to > >>>>>>> e.g. the C++ Arrow library. > >>>>>>> > >>>>>>> (for example: "Duckdb wants to provide an option to return Arrow > data > >>>>> of > >>>>>>> result sets, but they don't like having Arrow as a dependency") > >>>>>>> > >>>>>>> One possibility would be to define a C-level protocol similar in > >>>>> spirit > >>>>>>> to the Python buffer protocol, which some people may be familiar > with > >>>>> (*). > >>>>>>> > >>>>>>> The basic idea is to define a simple C struct, which is ABI-stable > and > >>>>>>> describes an Arrow away adequately. The struct can be > >>>>> stack-allocated. > >>>>>>> Its definition can also be copied in another project (or interfaced > >>>>> with > >>>>>>> using a C FFI layer, depending on the language). > >>>>>>> > >>>>>>> There is no formal proposal, this message is meant to stir the > >>>>> discussion. > >>>>>>> > >>>>>>> Issues to work out: > >>>>>>> > >>>>>>> * Memory lifetime issues: where Python simply associates the > Py_buffer > >>>>>>> with a PyObject owner (a garbage-collected Python object), we need > >>>>>>> another means to control lifetime of pointed areas. One simple > >>>>>>> possibility is to include a destructor function pointer in the > >>>>> protocol > >>>>>>> struct. > >>>>>>> > >>>>>>> * Arrow type representation. We probably need some kind of > "format" > >>>>>>> mini-language to represent Arrow types, so that a type can be > >>>>> described > >>>>>>> using a `const char*`. Ideally, primitives types at least should > be > >>>>>>> trivially parsable. We may take inspiration from Python here > >>>>> (`struct` > >>>>>>> module format characters, PEP 3118 format additions). > >>>>>>> > >>>>>>> Example C struct definition (not a formal proposal!): > >>>>>>> > >>>>>>> struct ArrowBuffer { > >>>>>>> void* data; > >>>>>>> int64_t nbytes; > >>>>>>> // Called by the consumer when it doesn't need the buffer anymore > >>>>>>> void (*release)(struct ArrowBuffer*); > >>>>>>> // Opaque user data (for e.g. the release callback) > >>>>>>> void* user_data; > >>>>>>> }; > >>>>>>> > >>>>>>> struct ArrowArray { > >>>>>>> // Type description > >>>>>>> const char* format; > >>>>>>> // Data description > >>>>>>> int64_t length; > >>>>>>> int64_t null_count; > >>>>>>> int64_t n_buffers; > >>>>>>> // Note: this pointers are probably owned by the ArrowArray > struct > >>>>>>> // and will be released and free()ed by the release callback. > >>>>>>> struct BufferDescriptor* buffers; > >>>>>>> struct ArrowDescriptor* dictionary; > >>>>>>> // Called by the consumer when it doesn't need the array anymore > >>>>>>> void (*release)(struct ArrowArrayDescriptor*); > >>>>>>> // Opaque user data (for e.g. the release callback) > >>>>>>> void* user_data; > >>>>>>> }; > >>>>>>> > >>>>>>> Thoughts? > >>>>>>> > >>>>>>> (*) For the record, the reference for the Python buffer protocol: > >>>>>>> https://docs.python.org/3/c-api/buffer.html#buffer-structure > >>>>>>> and its C struct definition: > >>>>>>> > >>>>> > https://github.com/python/cpython/blob/v3.7.4/Include/object.h#L181-L195 > >>>>>>> > >>>>>>> Regards > >>>>>>> > >>>>>>> Antoine. > >>>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>> > >>>> > >>> >