Hi Wes, That indeeds seems as a good fit for the pandas ExtensionArray <-> Arrow conversion. I will look into it starting this week.
Joris Op vr 17 mei 2019 om 00:28 schreef Wes McKinney <wesmck...@gmail.com>: > hi Joris, > > Somewhat related to this, I want to also point out that we have C++ > extension types [1]. As part of this, it would also be good to define > and document a public API for users to create ExtensionArray > subclasses that can be serialized and deserialized using this > machinery. > > As a motivating example, suppose that a Java application has a special > data type that can be serialized as a Binary value in Arrow, and we > want to be able to receive this special object as a pandas > ExtensionArray column, which unboxing into a Python user space type. > > The ExtensionType can be implemented in Java, and then on the Python > side the implementation can occur either in C++ or Python. An API will > need to be defined to serializer functions for the pandas > ExtensionArray to map the pandas-space type onto the the Arrow-space > type. Does this seem like a project you might be able to help drive > forward? As a matter of sequencing, we do not yet have the capability > to interact with C++ ExtensionType in Python, so we might need to > first create callback machinery to enable Arrow extension types to be > defined in Python (that call into the C++ ExtensionType registry) > > - Wes > > [1]: > https://github.com/apache/arrow/blob/master/cpp/src/arrow/extension_type-test.cc > > On Fri, May 10, 2019 at 2:11 AM Joris Van den Bossche > <jorisvandenboss...@gmail.com> wrote: > > > > Op do 9 mei 2019 om 21:38 schreef Uwe L. Korn <uw...@xhochy.com>: > > > > > +1 to the idea of adding a protocol to let other objects define their > way > > > to Arrow structures. For pandas.Series I would expect that they return > an > > > Arrow Column. > > > > > > For the Arrow->pandas conversion I have a bit mixed feelings. In the > > > normal Fletcher case I would expect that we don't convert anything as > we > > > represent anything from Arrow with it. > > > > > > Yes, you don't want to convert anything (apart from wrapping the arrow > > array into a FletcherArray). But how does Table.to_pandas know that? > > Maybe it doesn't need to know that. And then you might write a function > in > > fletcher to convert a pyarrow Table to a pandas DataFrame with > > fletcher-backed columns. But if you want to have this roundtrip > > automatically, without the need that each project that defines an > > ExtensionArray and wants to interact with arrow (eg in GeoPandas as well) > > needs to have his own "arrow-table-to-pandas-dataframe" converter, > pyarrow > > needs to have some notion of how to convert back to a pandas > ExtensionArray. > > > > > > > For the case where we want to restore the exact pandas DataFrame we had > > > before this will become a bit more complicated as we either would need > to > > > have all third-party libraries to support Arrow via a hook as proposed > or > > > we also define some kind of other protocol on the pandas side to > > > reconstruct ExtensionArrays from Arrow data. > > > > > > > That last one is basically what I proposed in > > > https://github.com/pandas-dev/pandas/issues/20612/#issuecomment-489649556 > > > > Thanks Antoine and Uwe for the discussion! > > > > Joris >