In addition to Wes' reference to the Arrow C data interface, I think it is also important to clarify some aspects.
In numpy, you have the "array interface" (`__array_interface__` property) and the "array dunder method" (`__array__` method). When speaking about the array protocol typically the first is meant I think (although this can easily be confusing I think) and this is what exposes the actual memory buffer (generalized by the python buffer protocol). But in practice, many custom array-like containers (eg pandas, xarray, ..) actually implement the second option to ensure numpy knows how to convert this container to a numpy array and operate on it. And the __array__ method also requires that an actual numpy.ndarray is returned (can be tested with a small example, or inferred from the code <https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/ctors.c#L2107-L2151>). So the __arrow_array__ method should rather be compared with numpy's __array__ method instead of the __array_interface__ property, and thus actually works exactly the same as the __array__ method regarding the return type. Then, for an equivalent of numpy's __array_interface__ (or more in general the python buffer protocol), it's indeed correct to point to the Arrow C data interface. Maybe it could make sense to at some point add an "__arrow_array_interface__" dunder method to make it easier to expose this from Python. But I am not very familiar with the details how this could work (currently a specific c struct is expected, and not a python dict like the numpy array interface). Joris On Sat, 12 Sep 2020 at 22:21, Wes McKinney <wesmck...@gmail.com> wrote: > Adding dev@ > > The is one purpose of the Arrow C data interface, which was developed > after the __arrow_array__ protocol, and worth investigating > > > https://github.com/apache/arrow/blob/master/docs/source/format/CDataInterface.rst > > On Sat, Sep 12, 2020 at 2:16 PM Marc Garcia <garcia.m...@gmail.com> wrote: > > > > Hi there, > > > > I'm writing a document analyzing different options for a Python > dataframe exchange protocol. And I wanted to ask a question regarding the > __arrow_array__ protocol. > > > > I checked the code, and looks like the producer is expected to be > sending an Arrow array, and the consumer just receives it. This is the code > I'm checking, I guess it's the right one: > https://github.com/apache/arrow/blob/master/python/pyarrow/array.pxi#L110 > > > > Compared to the array interface (the NumPy buffer protocol), it works a > bit differently. In the NumPy one, the producer exposes the pointer, the > size... So, the producer doesn't need to depend on NumPy or any other > library, and then the consumer can simply use `numpy.array(obj)` and > generate the NumPy array. Or if other implementations support the protocol > (not sure if they do), they could call something like > `tensorflow.Tensor(obj)`, and NumPy would not be used at all. > > > > Am I understanding correctly the `__arrow_array__` protocol? And if I > am, is there anything else similar to the NumPy protocol that can be used > to exchange data without relying on a particular implementation? > > > > Thanks in advance! >