Re: [Discuss] [Python] protocol for conversion to pyarrow Array

Uwe L. Korn Thu, 09 May 2019 12:39:19 -0700

+1 to the idea of adding a protocol to let other objects define their way to 
Arrow structures. For pandas.Series I would expect that they return an Arrow 
Column.


For the Arrow->pandas conversion I have a bit mixed feelings. In the normal 
Fletcher case I would expect that we don't convert anything as we represent 
anything from Arrow with it. For the case where we want to restore the exact 
pandas DataFrame we had before this will become a bit more complicated as we 
either would need to have all third-party libraries to support Arrow via a hook 
as proposed or we also define some kind of other protocol on the pandas side to 
reconstruct ExtensionArrays from Arrow data.

Uwe

> Am 09.05.2019 um 18:20 schrieb Antoine Pitrou <anto...@python.org>:
> 
> 
> Arrow arrays don't have metadata, so if you want to pass metadata around
> you should at least add a hook for columns as well.
> 
> Regards
> 
> Antoine.
> 
> 
>> Le 09/05/2019 à 18:10, Joris Van den Bossche a écrit :
>> An additional question might be at which "level" to provide such a hook to
>> third-party packages: I proposed for Array, but what for chunked arrays,
>> columns or tables? Maybe at least returning a chunked array should also be
>> allowed.
>> 
>> Op do 9 mei 2019 om 18:06 schreef Joris Van den Bossche <
>> jorisvandenboss...@gmail.com>:
>> 
>>> The signature I had in mind is something like:
>>> 
>>> def __arrow_array__(self, type : pyarrow.DataType=None) -> pyarrow.Array:
>>> 
>>> where the function returns a pyarrow.Array, and takes an optional data
>>> type (in case there are multiple ways to convert to a pyarrow Array, and
>>> what can be passed by the user in the type argument in pyarrow.array(..) or
>>> in a specified schema).
>>> 
>>> But, the above is only for a one way path of custom array to Arrow array,
>>> and not enough for a full roundtrip.
>>> 
>>> For a full roundtrip in case of a pandas DataFrame, we will still need to
>>> save information in metadata independently from __arrow_array__ and have
>>> custom code in pyarrow to deal with pandas DataFrames (of which there is
>>> already a lot). I mentioned this briefly in
>>> https://github.com/pandas-dev/pandas/issues/20612/#issuecomment-489649556
>>> / https://issues.apache.org/jira/browse/ARROW-2428, but one option could
>>> be to save the name of the pandas extension dtype in the pandas_metadata of
>>> an arrow Table (just as already happens for currently supported types), and
>>> when exporting back to pandas with to_pandas pyarrow could check if this
>>> extension dtype name is registered with pandas and if so, call a method
>>> there to construct it.
>>> 
>>> Joris
>>> 
>>> Op do 9 mei 2019 om 17:38 schreef Antoine Pitrou <anto...@python.org>:
>>> 
>>>> 
>>>> Hi Joris,
>>>> 
>>>> Do you have a signature for __arrow_array__ method in mind?
>>>> 
>>>> For example, let's say you want to roundtrip ExtensionArrays or other
>>>> third-party data through Arrow.  How do you preserve the required
>>>> metadata?
>>>> 
>>>> Regards
>>>> 
>>>> Antoine.
>>>> 
>>>> 
>>>>> Le 09/05/2019 à 13:29, Joris Van den Bossche a écrit :
>>>>> Hi all,
>>>>> 
>>>>> I want to propose an interface to allow custom array objects in Python
>>>> to
>>>>> define how they should be converted to Arrow arrays (e.g. in
>>>>> pyarrow.array(..)). I opened
>>>>> https://issues.apache.org/jira/browse/ARROW-5271 for this.
>>>>> This would be similar to the numpy __array__ protocol (so we could eg
>>>> call
>>>>> it __arrow_array__).
>>>>> Feedback / discussion very welcome!
>>>>> 
>>>>> I am coming to this discussion specifically from the point of view of
>>>>> pandas ExtensionArrays (github issue for this:
>>>>> 
>>>> https://github.com/pandas-dev/pandas/issues/20612/#issuecomment-489649556
>>>> ).
>>>>> Such a protocol would, for example, make it possible that pandas users
>>>> can
>>>>> save DataFrames with ExtensionArrays (eg the nullable integers) to
>>>> parquet,
>>>>> without the need for pyarrow to know about all those possible different
>>>>> extension arrays. This would also be useful for projects extending
>>>> pandas
>>>>> such as GeoPandas <https://github.com/geopandas/geopandas> and Fletcher
>>>>> <https://github.com/xhochy/fletcher>.
>>>>> But I suppose it could also be of interest more in general of other
>>>>> array-like / pandas-like projects that want to interface with arrow.
>>>>> 
>>>>> Sidenote: for the pandas case, I want to look a the full roundtrip, so
>>>> also
>>>>> the conversion back from an arrow Table to DataFrame. For that aspect
>>>> there
>>>>> is https://issues.apache.org/jira/browse/ARROW-2428, but this is much
>>>> more
>>>>> specific to pandas and its ExtensionArrays.
>>>>> 
>>>>> Regards,
>>>>> Joris
>>>>> 
>>>> 
>>> 
>>

Re: [Discuss] [Python] protocol for conversion to pyarrow Array

Reply via email to