Le 29/09/2019 à 06:10, Jacques Nadeau a écrit : > * No dependency on Flatbuffers. > * No buffer reassembly (data is already exposed in logical Arrow format). > * Zero-copy by design. > * Easy to reimplement from scratch. > > I don't see how the flatbuffer pattern for data headers doesn't accomplish > all of these things. At its definition, is a very simple representation of > data that could be worked with independently of the flatbuffers codebase. > It was designed so systems could map directly into that memory without > interacting with a flatbuffers library. > > Specifically the following three structures were designed to already allow > what I think this proposal is trying to recreate. All three are very simple > to construct in a direct, non-flatbuffer dependent read/write pattern.
Are they? Personally, I wouldn't know how to do that. I don't know which encoding Flatbuffers use, whether it's C ABI-compatible (how could it be? if it's portable accross different platforms, then it's probably not compatible with any particular platform's C ABI, or only as a conincidence), how I'm supposed to make use of the "offset" field, or what the lifetime / ownership of all this data is. I may be missing something, but if the answer is that it's easy to reimplement Flatbuffers' encoding without relying on the Flatbuffers project's source code, I'm a bit skeptical. Regards Antoine. > > struct FieldNode { > length: long; > null_count: long; > } > > struct Buffer { > offset: long; > length: long; > } > > table RecordBatch { > length: long; > nodes: [FieldNode]; > buffers: [Buffer]; > } > > On Sat, Sep 28, 2019 at 9:02 PM Jacques Nadeau <jacq...@apache.org> wrote: > >> I'm not clear on why we need to introduce something beyond what >> flatbuffers already provides. Can someone explain that to me? I'm not >> really a fan of introducing a second representation of the same data (as I >> understand it). >> >> On Thu, Sep 19, 2019 at 1:15 PM Wes McKinney <wesmck...@gmail.com> wrote: >> >>> This is helpful, I will leave some comments on the proposal when I >>> can, sometime in the next week. >>> >>> I agree that it would likely be opening a can of worms to create a >>> semantic mapping between a generalized type grammar and Arrow's >>> specific logical types defined in Schema.fbs. If we go down this >>> route, we should probably utilize the simplest possible grammar that >>> is capable of encoding the Type Flatbuffers union values. >>> >>> On Thu, Sep 19, 2019 at 2:49 PM Antoine Pitrou <solip...@pitrou.net> >>> wrote: >>>> >>>> >>>> I've posted a draft specification PR here, this should help orient the >>>> discussion a bit: >>>> https://github.com/apache/arrow/pull/5442 >>>> >>>> Regards >>>> >>>> Antoine. >>>> >>>> >>>> >>>> On Wed, 18 Sep 2019 19:52:38 +0200 >>>> Antoine Pitrou <anto...@python.org> wrote: >>>>> Hello, >>>>> >>>>> One thing that was discussed in the sync call is the ability to easily >>>>> pass arrays at runtime between Arrow implementations or >>> Arrow-supporting >>>>> libraries in the same process, without bearing the cost of linking to >>>>> e.g. the C++ Arrow library. >>>>> >>>>> (for example: "Duckdb wants to provide an option to return Arrow data >>> of >>>>> result sets, but they don't like having Arrow as a dependency") >>>>> >>>>> One possibility would be to define a C-level protocol similar in >>> spirit >>>>> to the Python buffer protocol, which some people may be familiar with >>> (*). >>>>> >>>>> The basic idea is to define a simple C struct, which is ABI-stable and >>>>> describes an Arrow away adequately. The struct can be >>> stack-allocated. >>>>> Its definition can also be copied in another project (or interfaced >>> with >>>>> using a C FFI layer, depending on the language). >>>>> >>>>> There is no formal proposal, this message is meant to stir the >>> discussion. >>>>> >>>>> Issues to work out: >>>>> >>>>> * Memory lifetime issues: where Python simply associates the Py_buffer >>>>> with a PyObject owner (a garbage-collected Python object), we need >>>>> another means to control lifetime of pointed areas. One simple >>>>> possibility is to include a destructor function pointer in the >>> protocol >>>>> struct. >>>>> >>>>> * Arrow type representation. We probably need some kind of "format" >>>>> mini-language to represent Arrow types, so that a type can be >>> described >>>>> using a `const char*`. Ideally, primitives types at least should be >>>>> trivially parsable. We may take inspiration from Python here >>> (`struct` >>>>> module format characters, PEP 3118 format additions). >>>>> >>>>> Example C struct definition (not a formal proposal!): >>>>> >>>>> struct ArrowBuffer { >>>>> void* data; >>>>> int64_t nbytes; >>>>> // Called by the consumer when it doesn't need the buffer anymore >>>>> void (*release)(struct ArrowBuffer*); >>>>> // Opaque user data (for e.g. the release callback) >>>>> void* user_data; >>>>> }; >>>>> >>>>> struct ArrowArray { >>>>> // Type description >>>>> const char* format; >>>>> // Data description >>>>> int64_t length; >>>>> int64_t null_count; >>>>> int64_t n_buffers; >>>>> // Note: this pointers are probably owned by the ArrowArray struct >>>>> // and will be released and free()ed by the release callback. >>>>> struct BufferDescriptor* buffers; >>>>> struct ArrowDescriptor* dictionary; >>>>> // Called by the consumer when it doesn't need the array anymore >>>>> void (*release)(struct ArrowArrayDescriptor*); >>>>> // Opaque user data (for e.g. the release callback) >>>>> void* user_data; >>>>> }; >>>>> >>>>> Thoughts? >>>>> >>>>> (*) For the record, the reference for the Python buffer protocol: >>>>> https://docs.python.org/3/c-api/buffer.html#buffer-structure >>>>> and its C struct definition: >>>>> >>> https://github.com/python/cpython/blob/v3.7.4/Include/object.h#L181-L195 >>>>> >>>>> Regards >>>>> >>>>> Antoine. >>>>> >>>> >>>> >>>> >>> >> >