Hi Antoine, I'm also interested in a stable ABI (previously I posted on this mailing list about the ABI issues I had [1]). Does having such an ABI-stable C-struct imply that there will be a set of C-APIs exposed by the Arrow (C++) library (which I think would lead to a solution to all the inherit ABI issues caused by C++)?
[1] https://lists.apache.org/thread.html/27b6e2a30cf93c5f5f78de970c68c7d996f538d94ab61431fa342f41@%3Cdev.arrow.apache.org%3E On Thu, Sep 19, 2019 at 1:07 AM Antoine Pitrou <anto...@python.org> wrote: > > Le 19/09/2019 à 09:39, Micah Kornfield a écrit : > > I like the idea of a stable ABI for in-processing that can be used for > in > > process communication. For instance, there was a recent question on > > stack-overflow on how to solve this [1]. > > > > A couple of thoughts/questions: > > * Would ArrowArray also need a self reference for children arrays? > > Yes, I forgot that. I also think we don't need a separate Buffer > struct, instead the Array struct should own all its buffers. > > > * Should transferring key-value metadata be in scope? > > Yes. It could either be in the format string or a separate string. The > upside of a separate string is that a consumer may ignore it trivially > if it doesn't need the information. > > Another open question is for nested types: does the format string > represent the entire type including children? Or must child types be > read in the child arrays? If we mimick ArrayData, then the format > string should represent the entire type; it will then be more complex to > parse. > > We should also make sure that extension types fit in the protocol. > > > * Should the API more closely align the IPC spec (pass a schema > separately > > and list of buffers instead of individual arrays)? > > Then you have that's not immediately usable (you have to do some > processing to reconstitute the individual arrays). One goal here is to > minimize implementation costs for producers and consumers. The > assumption is a data model similar to the C++ ArrowData model; do we > have implementations that use an entirely different model? Perhaps I > should take a look :-) > > Note that the draft I posted only concerns arrays. We may also want to > have a C struct for batches or tables. > > Regards > > Antoine. > > > > > > Thanks, > > Micah > > > > [1] > > > https://stackoverflow.com/questions/57966032/how-does-apache-arrow-facilitate-no-overhead-for-cross-system-communication/57967220#57967220 > > > > On Wed, Sep 18, 2019 at 10:52 AM Antoine Pitrou <anto...@python.org> > wrote: > > > >> > >> Hello, > >> > >> One thing that was discussed in the sync call is the ability to easily > >> pass arrays at runtime between Arrow implementations or Arrow-supporting > >> libraries in the same process, without bearing the cost of linking to > >> e.g. the C++ Arrow library. > >> > >> (for example: "Duckdb wants to provide an option to return Arrow data of > >> result sets, but they don't like having Arrow as a dependency") > >> > >> One possibility would be to define a C-level protocol similar in spirit > >> to the Python buffer protocol, which some people may be familiar with > (*). > >> > >> The basic idea is to define a simple C struct, which is ABI-stable and > >> describes an Arrow away adequately. The struct can be stack-allocated. > >> Its definition can also be copied in another project (or interfaced with > >> using a C FFI layer, depending on the language). > >> > >> There is no formal proposal, this message is meant to stir the > discussion. > >> > >> Issues to work out: > >> > >> * Memory lifetime issues: where Python simply associates the Py_buffer > >> with a PyObject owner (a garbage-collected Python object), we need > >> another means to control lifetime of pointed areas. One simple > >> possibility is to include a destructor function pointer in the protocol > >> struct. > >> > >> * Arrow type representation. We probably need some kind of "format" > >> mini-language to represent Arrow types, so that a type can be described > >> using a `const char*`. Ideally, primitives types at least should be > >> trivially parsable. We may take inspiration from Python here (`struct` > >> module format characters, PEP 3118 format additions). > >> > >> Example C struct definition (not a formal proposal!): > >> > >> struct ArrowBuffer { > >> void* data; > >> int64_t nbytes; > >> // Called by the consumer when it doesn't need the buffer anymore > >> void (*release)(struct ArrowBuffer*); > >> // Opaque user data (for e.g. the release callback) > >> void* user_data; > >> }; > >> > >> struct ArrowArray { > >> // Type description > >> const char* format; > >> // Data description > >> int64_t length; > >> int64_t null_count; > >> int64_t n_buffers; > >> // Note: this pointers are probably owned by the ArrowArray struct > >> // and will be released and free()ed by the release callback. > >> struct BufferDescriptor* buffers; > >> struct ArrowDescriptor* dictionary; > >> // Called by the consumer when it doesn't need the array anymore > >> void (*release)(struct ArrowArrayDescriptor*); > >> // Opaque user data (for e.g. the release callback) > >> void* user_data; > >> }; > >> > >> Thoughts? > >> > >> (*) For the record, the reference for the Python buffer protocol: > >> https://docs.python.org/3/c-api/buffer.html#buffer-structure > >> and its C struct definition: > >> > https://github.com/python/cpython/blob/v3.7.4/Include/object.h#L181-L195 > >> > >> Regards > >> > >> Antoine. > >> > > >