Re: [DISCUSS] C-level in-process array protocol

Antoine Pitrou Sun, 29 Sep 2019 00:59:50 -0700


Le 29/09/2019 à 06:10, Jacques Nadeau a écrit :
> * No dependency on Flatbuffers.
> * No buffer reassembly (data is already exposed in logical Arrow format).
> * Zero-copy by design.
> * Easy to reimplement from scratch.
> 
> I don't see how the flatbuffer pattern for data headers doesn't accomplish
> all of these things. At its definition, is a very simple representation of
> data that could be worked with independently of the flatbuffers codebase.
> It was designed so systems could map directly into that memory without
> interacting with a flatbuffers library.
> 
> Specifically the following three structures were designed to already allow
> what I think this proposal is trying to recreate. All three are very simple
> to construct in a direct, non-flatbuffer dependent read/write pattern.


Are they?  Personally, I wouldn't know how to do that.  I don't know
which encoding Flatbuffers use, whether it's C ABI-compatible (how could
it be? if it's portable accross different platforms, then it's probably
not compatible with any particular platform's C ABI, or only as a
conincidence), how I'm supposed to make use of the "offset" field, or
what the lifetime / ownership of all this data is.

I may be missing something, but if the answer is that it's easy to
reimplement Flatbuffers' encoding without relying on the Flatbuffers
project's source code, I'm a bit skeptical.

Regards

Antoine.


> 
> struct FieldNode {
>   length: long;
>   null_count: long;
> }
> 
> struct Buffer {
>   offset: long;
>   length: long;
> }
> 
> table RecordBatch {
>   length: long;
>   nodes: [FieldNode];
>   buffers: [Buffer];
> }
> 
> On Sat, Sep 28, 2019 at 9:02 PM Jacques Nadeau <jacq...@apache.org> wrote:
> 
>> I'm not clear on why we need to introduce something beyond what
>> flatbuffers already provides. Can someone explain that to me? I'm not
>> really a fan of introducing a second representation of the same data (as I
>> understand it).
>>
>> On Thu, Sep 19, 2019 at 1:15 PM Wes McKinney <wesmck...@gmail.com> wrote:
>>
>>> This is helpful, I will leave some comments on the proposal when I
>>> can, sometime in the next week.
>>>
>>> I agree that it would likely be opening a can of worms to create a
>>> semantic mapping between a generalized type grammar and Arrow's
>>> specific logical types defined in Schema.fbs. If we go down this
>>> route, we should probably utilize the simplest possible grammar that
>>> is capable of encoding the Type Flatbuffers union values.
>>>
>>> On Thu, Sep 19, 2019 at 2:49 PM Antoine Pitrou <solip...@pitrou.net>
>>> wrote:
>>>>
>>>>
>>>> I've posted a draft specification PR here, this should help orient the
>>>> discussion a bit:
>>>> https://github.com/apache/arrow/pull/5442
>>>>
>>>> Regards
>>>>
>>>> Antoine.
>>>>
>>>>
>>>>
>>>> On Wed, 18 Sep 2019 19:52:38 +0200
>>>> Antoine Pitrou <anto...@python.org> wrote:
>>>>> Hello,
>>>>>
>>>>> One thing that was discussed in the sync call is the ability to easily
>>>>> pass arrays at runtime between Arrow implementations or
>>> Arrow-supporting
>>>>> libraries in the same process, without bearing the cost of linking to
>>>>> e.g. the C++ Arrow library.
>>>>>
>>>>> (for example: "Duckdb wants to provide an option to return Arrow data
>>> of
>>>>> result sets, but they don't like having Arrow as a dependency")
>>>>>
>>>>> One possibility would be to define a C-level protocol similar in
>>> spirit
>>>>> to the Python buffer protocol, which some people may be familiar with
>>> (*).
>>>>>
>>>>> The basic idea is to define a simple C struct, which is ABI-stable and
>>>>> describes an Arrow away adequately.  The struct can be
>>> stack-allocated.
>>>>> Its definition can also be copied in another project (or interfaced
>>> with
>>>>> using a C FFI layer, depending on the language).
>>>>>
>>>>> There is no formal proposal, this message is meant to stir the
>>> discussion.
>>>>>
>>>>> Issues to work out:
>>>>>
>>>>> * Memory lifetime issues: where Python simply associates the Py_buffer
>>>>> with a PyObject owner (a garbage-collected Python object), we need
>>>>> another means to control lifetime of pointed areas.  One simple
>>>>> possibility is to include a destructor function pointer in the
>>> protocol
>>>>> struct.
>>>>>
>>>>> * Arrow type representation.  We probably need some kind of "format"
>>>>> mini-language to represent Arrow types, so that a type can be
>>> described
>>>>> using a `const char*`.  Ideally, primitives types at least should be
>>>>> trivially parsable.  We may take inspiration from Python here
>>> (`struct`
>>>>> module format characters, PEP 3118 format additions).
>>>>>
>>>>> Example C struct definition (not a formal proposal!):
>>>>>
>>>>> struct ArrowBuffer {
>>>>>   void* data;
>>>>>   int64_t nbytes;
>>>>>   // Called by the consumer when it doesn't need the buffer anymore
>>>>>   void (*release)(struct ArrowBuffer*);
>>>>>   // Opaque user data (for e.g. the release callback)
>>>>>   void* user_data;
>>>>> };
>>>>>
>>>>> struct ArrowArray {
>>>>>   // Type description
>>>>>   const char* format;
>>>>>   // Data description
>>>>>   int64_t length;
>>>>>   int64_t null_count;
>>>>>   int64_t n_buffers;
>>>>>   // Note: this pointers are probably owned by the ArrowArray struct
>>>>>   // and will be released and free()ed by the release callback.
>>>>>   struct BufferDescriptor* buffers;
>>>>>   struct ArrowDescriptor* dictionary;
>>>>>   // Called by the consumer when it doesn't need the array anymore
>>>>>   void (*release)(struct ArrowArrayDescriptor*);
>>>>>   // Opaque user data (for e.g. the release callback)
>>>>>   void* user_data;
>>>>> };
>>>>>
>>>>> Thoughts?
>>>>>
>>>>> (*) For the record, the reference for the Python buffer protocol:
>>>>> https://docs.python.org/3/c-api/buffer.html#buffer-structure
>>>>> and its C struct definition:
>>>>>
>>> https://github.com/python/cpython/blob/v3.7.4/Include/object.h#L181-L195
>>>>>
>>>>> Regards
>>>>>
>>>>> Antoine.
>>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: [DISCUSS] C-level in-process array protocol

Reply via email to