Hello,

One thing that was discussed in the sync call is the ability to easily
pass arrays at runtime between Arrow implementations or Arrow-supporting
libraries in the same process, without bearing the cost of linking to
e.g. the C++ Arrow library.

(for example: "Duckdb wants to provide an option to return Arrow data of
result sets, but they don't like having Arrow as a dependency")

One possibility would be to define a C-level protocol similar in spirit
to the Python buffer protocol, which some people may be familiar with (*).

The basic idea is to define a simple C struct, which is ABI-stable and
describes an Arrow away adequately.  The struct can be stack-allocated.
Its definition can also be copied in another project (or interfaced with
using a C FFI layer, depending on the language).

There is no formal proposal, this message is meant to stir the discussion.

Issues to work out:

* Memory lifetime issues: where Python simply associates the Py_buffer
with a PyObject owner (a garbage-collected Python object), we need
another means to control lifetime of pointed areas.  One simple
possibility is to include a destructor function pointer in the protocol
struct.

* Arrow type representation.  We probably need some kind of "format"
mini-language to represent Arrow types, so that a type can be described
using a `const char*`.  Ideally, primitives types at least should be
trivially parsable.  We may take inspiration from Python here (`struct`
module format characters, PEP 3118 format additions).

Example C struct definition (not a formal proposal!):

struct ArrowBuffer {
  void* data;
  int64_t nbytes;
  // Called by the consumer when it doesn't need the buffer anymore
  void (*release)(struct ArrowBuffer*);
  // Opaque user data (for e.g. the release callback)
  void* user_data;
};

struct ArrowArray {
  // Type description
  const char* format;
  // Data description
  int64_t length;
  int64_t null_count;
  int64_t n_buffers;
  // Note: this pointers are probably owned by the ArrowArray struct
  // and will be released and free()ed by the release callback.
  struct BufferDescriptor* buffers;
  struct ArrowDescriptor* dictionary;
  // Called by the consumer when it doesn't need the array anymore
  void (*release)(struct ArrowArrayDescriptor*);
  // Opaque user data (for e.g. the release callback)
  void* user_data;
};

Thoughts?

(*) For the record, the reference for the Python buffer protocol:
https://docs.python.org/3/c-api/buffer.html#buffer-structure
and its C struct definition:
https://github.com/python/cpython/blob/v3.7.4/Include/object.h#L181-L195

Regards

Antoine.

Reply via email to