hi Joris, Somewhat related to this, I want to also point out that we have C++ extension types [1]. As part of this, it would also be good to define and document a public API for users to create ExtensionArray subclasses that can be serialized and deserialized using this machinery.
As a motivating example, suppose that a Java application has a special data type that can be serialized as a Binary value in Arrow, and we want to be able to receive this special object as a pandas ExtensionArray column, which unboxing into a Python user space type. The ExtensionType can be implemented in Java, and then on the Python side the implementation can occur either in C++ or Python. An API will need to be defined to serializer functions for the pandas ExtensionArray to map the pandas-space type onto the the Arrow-space type. Does this seem like a project you might be able to help drive forward? As a matter of sequencing, we do not yet have the capability to interact with C++ ExtensionType in Python, so we might need to first create callback machinery to enable Arrow extension types to be defined in Python (that call into the C++ ExtensionType registry) - Wes [1]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/extension_type-test.cc On Fri, May 10, 2019 at 2:11 AM Joris Van den Bossche <jorisvandenboss...@gmail.com> wrote: > > Op do 9 mei 2019 om 21:38 schreef Uwe L. Korn <uw...@xhochy.com>: > > > +1 to the idea of adding a protocol to let other objects define their way > > to Arrow structures. For pandas.Series I would expect that they return an > > Arrow Column. > > > > For the Arrow->pandas conversion I have a bit mixed feelings. In the > > normal Fletcher case I would expect that we don't convert anything as we > > represent anything from Arrow with it. > > > Yes, you don't want to convert anything (apart from wrapping the arrow > array into a FletcherArray). But how does Table.to_pandas know that? > Maybe it doesn't need to know that. And then you might write a function in > fletcher to convert a pyarrow Table to a pandas DataFrame with > fletcher-backed columns. But if you want to have this roundtrip > automatically, without the need that each project that defines an > ExtensionArray and wants to interact with arrow (eg in GeoPandas as well) > needs to have his own "arrow-table-to-pandas-dataframe" converter, pyarrow > needs to have some notion of how to convert back to a pandas ExtensionArray. > > > > For the case where we want to restore the exact pandas DataFrame we had > > before this will become a bit more complicated as we either would need to > > have all third-party libraries to support Arrow via a hook as proposed or > > we also define some kind of other protocol on the pandas side to > > reconstruct ExtensionArrays from Arrow data. > > > > That last one is basically what I proposed in > https://github.com/pandas-dev/pandas/issues/20612/#issuecomment-489649556 > > Thanks Antoine and Uwe for the discussion! > > Joris