In addition to Wes' reference to the Arrow C data interface, I think it is
also important to clarify some aspects.

In numpy, you have the "array interface" (`__array_interface__` property)
and the "array dunder method" (`__array__` method). When speaking about the
array protocol typically the first is meant I think (although this can
easily be confusing I think) and this is what exposes the actual memory
buffer (generalized by the python buffer protocol). But in practice, many
custom array-like containers (eg pandas, xarray, ..) actually implement the
second option to ensure numpy knows how to convert this container to a
numpy array and operate on it.

And the __array__ method also requires that an actual numpy.ndarray is
returned (can be tested with a small example, or inferred from the code
<https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/ctors.c#L2107-L2151>).

So the __arrow_array__ method should rather be compared with numpy's
__array__ method instead of the __array_interface__ property, and thus
actually works exactly the same as the __array__ method regarding the
return type. Then, for an equivalent of numpy's __array_interface__ (or
more in general the python buffer protocol), it's indeed correct to point
to the Arrow C data interface.

Maybe it could make sense to at some point add an
"__arrow_array_interface__" dunder method to make it easier to expose this
from Python. But I am not very familiar with the details how this could
work (currently a specific c struct is expected, and not a python dict like
the numpy array interface).

Joris

On Sat, 12 Sep 2020 at 22:21, Wes McKinney <wesmck...@gmail.com> wrote:

> Adding dev@
>
> The is one purpose of the Arrow C data interface, which was developed
> after the __arrow_array__ protocol, and worth investigating
>
>
> https://github.com/apache/arrow/blob/master/docs/source/format/CDataInterface.rst
>
> On Sat, Sep 12, 2020 at 2:16 PM Marc Garcia <garcia.m...@gmail.com> wrote:
> >
> > Hi there,
> >
> > I'm writing a document analyzing different options for a Python
> dataframe exchange protocol. And I wanted to ask a question regarding the
> __arrow_array__ protocol.
> >
> > I checked the code, and looks like the producer is expected to be
> sending an Arrow array, and the consumer just receives it. This is the code
> I'm checking, I guess it's the right one:
> https://github.com/apache/arrow/blob/master/python/pyarrow/array.pxi#L110
> >
> > Compared to the array interface (the NumPy buffer protocol), it works a
> bit differently. In the NumPy one, the producer exposes the pointer, the
> size... So, the producer doesn't need to depend on NumPy or any other
> library, and then the consumer can simply use `numpy.array(obj)` and
> generate the NumPy array. Or if other implementations support the protocol
> (not sure if they do), they could call something like
> `tensorflow.Tensor(obj)`, and NumPy would not be used at all.
> >
> > Am I understanding correctly the `__arrow_array__` protocol? And if I
> am, is there anything else similar to the NumPy protocol that can be used
> to exchange data without relying on a particular implementation?
> >
> > Thanks in advance!
>

Reply via email to