Re: [Discuss] [Python] protocol for conversion to pyarrow Array

Wes McKinney Thu, 16 May 2019 15:29:11 -0700

hi Joris,

Somewhat related to this, I want to also point out that we have C++
extension types [1]. As part of this, it would also be good to define
and document a public API for users to create ExtensionArray
subclasses that can be serialized and deserialized using this
machinery.

As a motivating example, suppose that a Java application has a special
data type that can be serialized as a Binary value in Arrow, and we
want to be able to receive this special object as a pandas
ExtensionArray column, which unboxing into a Python user space type.

The ExtensionType can be implemented in Java, and then on the Python
side the implementation can occur either in C++ or Python. An API will
need to be defined to serializer functions for the pandas
ExtensionArray to map the pandas-space type onto the the Arrow-space
type. Does this seem like a project you might be able to help drive
forward? As a matter of sequencing, we do not yet have the capability
to interact with C++ ExtensionType in Python, so we might need to
first create callback machinery to enable Arrow extension types to be
defined in Python (that call into the C++ ExtensionType registry)

- Wes

[1]: 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/extension_type-test.cc

On Fri, May 10, 2019 at 2:11 AM Joris Van den Bossche
<jorisvandenboss...@gmail.com> wrote:
>
> Op do 9 mei 2019 om 21:38 schreef Uwe L. Korn <uw...@xhochy.com>:
>
> > +1 to the idea of adding a protocol to let other objects define their way
> > to Arrow structures. For pandas.Series I would expect that they return an
> > Arrow Column.
> >
> > For the Arrow->pandas conversion I have a bit mixed feelings. In the
> > normal Fletcher case I would expect that we don't convert anything as we
> > represent anything from Arrow with it.
>
>
> Yes, you don't want to convert anything (apart from wrapping the arrow
> array into a FletcherArray). But how does Table.to_pandas know that?
> Maybe it doesn't need to know that. And then you might write a function in
> fletcher to convert a pyarrow Table to a pandas DataFrame with
> fletcher-backed columns. But if you want to have this roundtrip
> automatically, without the need that each project that defines an
> ExtensionArray and wants to interact with arrow (eg in GeoPandas as well)
> needs to have his own "arrow-table-to-pandas-dataframe" converter, pyarrow
> needs to have some notion of how to convert back to a pandas ExtensionArray.
>
>
> > For the case where we want to restore the exact pandas DataFrame we had
> > before this will become a bit more complicated as we either would need to
> > have all third-party libraries to support Arrow via a hook as proposed or
> > we also define some kind of other protocol on the pandas side to
> > reconstruct ExtensionArrays from Arrow data.
> >
>
> That last one is basically what I proposed in
> https://github.com/pandas-dev/pandas/issues/20612/#issuecomment-489649556
>
> Thanks Antoine and Uwe for the discussion!
>
> Joris

Re: [Discuss] [Python] protocol for conversion to pyarrow Array

Reply via email to