No concerns from me either.
On Mon, Aug 19, 2019 at 5:10 AM Antoine Pitrou <anto...@python.org> wrote: > > > No concern from me. It should probably be documented somewhere though :-) > > Regards > > Antoine. > > > Le 16/08/2019 à 17:23, Joris Van den Bossche a écrit : > > Coming back to this older thread, I have opened a PR with a proof of > > concept of the proposed protocol to convert third-party array objects to > > arrow: https://github.com/apache/arrow/pull/5106 > > In the tests, I added the protocol to pandas' nullable integer array (which > > is currently not supported in the from_pandas conversion) and this converts > > now nicely without much changes. > > > > Are there remaining concerns about such a protocol? > > > > -- > > > > Note that the protocol is only for pandas -> arrow conversion (or other > > array-like objects -> arrow). The other way around (arrow -> pandas) is > > more complex and needs further discussion, and also involves the Arrow > > ExtensionTypes (as mentioned below by Wes). > > But I think the protocol will be useful in any case, and we can go ahead > > with that already (for example, not all pandas ExtensionArrays will need to > > map to a Arrow ExtensionType, eg the nullable integers simply map to > > arrow's int64 or fletcher's ExtensionArrays which just wrap a arrow array). > > That said, I have been working on the arrow ExtensionTypes the last days, > > and have been keeping an overview of the issues and needed work in this > > google document: > > https://docs.google.com/document/d/1pr9PuBfXTdlUoAgyh9zPIKDJZalDLI6GuxqblMynMM8/edit?usp=sharing > > (feel free to comment on it). There is also an initial PR to extend the > > support for defining ExtensionTypes in Python (ARROW-5610 > > <https://issues.apache.org/jira/browse/ARROW-5610>, > > https://github.com/apache/arrow/pull/5094). > > > > Joris > > > > On Fri, 17 May 2019 at 00:28, Wes McKinney <wesmck...@gmail.com> wrote: > > > >> hi Joris, > >> > >> Somewhat related to this, I want to also point out that we have C++ > >> extension types [1]. As part of this, it would also be good to define > >> and document a public API for users to create ExtensionArray > >> subclasses that can be serialized and deserialized using this > >> machinery. > >> > >> As a motivating example, suppose that a Java application has a special > >> data type that can be serialized as a Binary value in Arrow, and we > >> want to be able to receive this special object as a pandas > >> ExtensionArray column, which unboxing into a Python user space type. > >> > >> The ExtensionType can be implemented in Java, and then on the Python > >> side the implementation can occur either in C++ or Python. An API will > >> need to be defined to serializer functions for the pandas > >> ExtensionArray to map the pandas-space type onto the the Arrow-space > >> type. Does this seem like a project you might be able to help drive > >> forward? As a matter of sequencing, we do not yet have the capability > >> to interact with C++ ExtensionType in Python, so we might need to > >> first create callback machinery to enable Arrow extension types to be > >> defined in Python (that call into the C++ ExtensionType registry) > >> > >> - Wes > >> > >> [1]: > >> https://github.com/apache/arrow/blob/master/cpp/src/arrow/extension_type-test.cc > >> > >> On Fri, May 10, 2019 at 2:11 AM Joris Van den Bossche > >> <jorisvandenboss...@gmail.com> wrote: > >>> > >>> Op do 9 mei 2019 om 21:38 schreef Uwe L. Korn <uw...@xhochy.com>: > >>> > >>>> +1 to the idea of adding a protocol to let other objects define their > >> way > >>>> to Arrow structures. For pandas.Series I would expect that they return > >> an > >>>> Arrow Column. > >>>> > >>>> For the Arrow->pandas conversion I have a bit mixed feelings. In the > >>>> normal Fletcher case I would expect that we don't convert anything as > >> we > >>>> represent anything from Arrow with it. > >>> > >>> > >>> Yes, you don't want to convert anything (apart from wrapping the arrow > >>> array into a FletcherArray). But how does Table.to_pandas know that? > >>> Maybe it doesn't need to know that. And then you might write a function > >> in > >>> fletcher to convert a pyarrow Table to a pandas DataFrame with > >>> fletcher-backed columns. But if you want to have this roundtrip > >>> automatically, without the need that each project that defines an > >>> ExtensionArray and wants to interact with arrow (eg in GeoPandas as well) > >>> needs to have his own "arrow-table-to-pandas-dataframe" converter, > >> pyarrow > >>> needs to have some notion of how to convert back to a pandas > >> ExtensionArray. > >>> > >>> > >>>> For the case where we want to restore the exact pandas DataFrame we had > >>>> before this will become a bit more complicated as we either would need > >> to > >>>> have all third-party libraries to support Arrow via a hook as proposed > >> or > >>>> we also define some kind of other protocol on the pandas side to > >>>> reconstruct ExtensionArrays from Arrow data. > >>>> > >>> > >>> That last one is basically what I proposed in > >>> > >> https://github.com/pandas-dev/pandas/issues/20612/#issuecomment-489649556 > >>> > >>> Thanks Antoine and Uwe for the discussion! > >>> > >>> Joris > >> > >