Re: [DISCUSS] C-level in-process array protocol

Zhuo Peng Thu, 19 Sep 2019 09:22:40 -0700

Hi Antoine,

I'm also interested in a stable ABI (previously I posted on this mailing
list about the ABI issues I had [1]). Does having such an ABI-stable
C-struct imply that there will be a set of C-APIs exposed by the Arrow
(C++) library (which I think would lead to a solution to all the inherit
ABI issues caused by C++)?


[1]
https://lists.apache.org/thread.html/27b6e2a30cf93c5f5f78de970c68c7d996f538d94ab61431fa342f41@%3Cdev.arrow.apache.org%3E

On Thu, Sep 19, 2019 at 1:07 AM Antoine Pitrou <anto...@python.org> wrote:

>
> Le 19/09/2019 à 09:39, Micah Kornfield a écrit :
> > I like the idea of a stable ABI for in-processing  that can be used for
> in
> > process communication.  For instance, there was a recent question on
> > stack-overflow on how to solve this [1].
> >
> > A couple of thoughts/questions:
> > * Would ArrowArray also need a self reference for children arrays?
>
> Yes, I forgot that.  I also think we don't need a separate Buffer
> struct, instead the Array struct should own all its buffers.
>
> > * Should transferring key-value metadata be in scope?
>
> Yes.  It could either be in the format string or a separate string.  The
> upside of a separate string is that a consumer may ignore it trivially
> if it doesn't need the information.
>
> Another open question is for nested types: does the format string
> represent the entire type including children?  Or must child types be
> read in the child arrays?  If we mimick ArrayData, then the format
> string should represent the entire type; it will then be more complex to
> parse.
>
> We should also make sure that extension types fit in the protocol.
>
> > * Should the API more closely align the IPC spec (pass a schema
> separately
> > and list of buffers instead of individual arrays)?
>
> Then you have that's not immediately usable (you have to do some
> processing to reconstitute the individual arrays).  One goal here is to
> minimize implementation costs for producers and consumers.  The
> assumption is a data model similar to the C++ ArrowData model; do we
> have implementations that use an entirely different model?  Perhaps I
> should take a look :-)
>
> Note that the draft I posted only concerns arrays.  We may also want to
> have a C struct for batches or tables.
>
> Regards
>
> Antoine.
>
>
> >
> > Thanks,
> > Micah
> >
> > [1]
> >
> https://stackoverflow.com/questions/57966032/how-does-apache-arrow-facilitate-no-overhead-for-cross-system-communication/57967220#57967220
> >
> > On Wed, Sep 18, 2019 at 10:52 AM Antoine Pitrou <anto...@python.org>
> wrote:
> >
> >>
> >> Hello,
> >>
> >> One thing that was discussed in the sync call is the ability to easily
> >> pass arrays at runtime between Arrow implementations or Arrow-supporting
> >> libraries in the same process, without bearing the cost of linking to
> >> e.g. the C++ Arrow library.
> >>
> >> (for example: "Duckdb wants to provide an option to return Arrow data of
> >> result sets, but they don't like having Arrow as a dependency")
> >>
> >> One possibility would be to define a C-level protocol similar in spirit
> >> to the Python buffer protocol, which some people may be familiar with
> (*).
> >>
> >> The basic idea is to define a simple C struct, which is ABI-stable and
> >> describes an Arrow away adequately.  The struct can be stack-allocated.
> >> Its definition can also be copied in another project (or interfaced with
> >> using a C FFI layer, depending on the language).
> >>
> >> There is no formal proposal, this message is meant to stir the
> discussion.
> >>
> >> Issues to work out:
> >>
> >> * Memory lifetime issues: where Python simply associates the Py_buffer
> >> with a PyObject owner (a garbage-collected Python object), we need
> >> another means to control lifetime of pointed areas.  One simple
> >> possibility is to include a destructor function pointer in the protocol
> >> struct.
> >>
> >> * Arrow type representation.  We probably need some kind of "format"
> >> mini-language to represent Arrow types, so that a type can be described
> >> using a `const char*`.  Ideally, primitives types at least should be
> >> trivially parsable.  We may take inspiration from Python here (`struct`
> >> module format characters, PEP 3118 format additions).
> >>
> >> Example C struct definition (not a formal proposal!):
> >>
> >> struct ArrowBuffer {
> >>   void* data;
> >>   int64_t nbytes;
> >>   // Called by the consumer when it doesn't need the buffer anymore
> >>   void (*release)(struct ArrowBuffer*);
> >>   // Opaque user data (for e.g. the release callback)
> >>   void* user_data;
> >> };
> >>
> >> struct ArrowArray {
> >>   // Type description
> >>   const char* format;
> >>   // Data description
> >>   int64_t length;
> >>   int64_t null_count;
> >>   int64_t n_buffers;
> >>   // Note: this pointers are probably owned by the ArrowArray struct
> >>   // and will be released and free()ed by the release callback.
> >>   struct BufferDescriptor* buffers;
> >>   struct ArrowDescriptor* dictionary;
> >>   // Called by the consumer when it doesn't need the array anymore
> >>   void (*release)(struct ArrowArrayDescriptor*);
> >>   // Opaque user data (for e.g. the release callback)
> >>   void* user_data;
> >> };
> >>
> >> Thoughts?
> >>
> >> (*) For the record, the reference for the Python buffer protocol:
> >> https://docs.python.org/3/c-api/buffer.html#buffer-structure
> >> and its C struct definition:
> >>
> https://github.com/python/cpython/blob/v3.7.4/Include/object.h#L181-L195
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >
>

Re: [DISCUSS] C-level in-process array protocol

Reply via email to