Re: [Discuss] [Python] protocol for conversion to pyarrow Array

Antoine Pitrou Thu, 09 May 2019 09:20:50 -0700


Arrow arrays don't have metadata, so if you want to pass metadata around
you should at least add a hook for columns as well.


Regards

Antoine.


Le 09/05/2019 à 18:10, Joris Van den Bossche a écrit :
> An additional question might be at which "level" to provide such a hook to
> third-party packages: I proposed for Array, but what for chunked arrays,
> columns or tables? Maybe at least returning a chunked array should also be
> allowed.
> 
> Op do 9 mei 2019 om 18:06 schreef Joris Van den Bossche <
> jorisvandenboss...@gmail.com>:
> 
>> The signature I had in mind is something like:
>>
>> def __arrow_array__(self, type : pyarrow.DataType=None) -> pyarrow.Array:
>>
>> where the function returns a pyarrow.Array, and takes an optional data
>> type (in case there are multiple ways to convert to a pyarrow Array, and
>> what can be passed by the user in the type argument in pyarrow.array(..) or
>> in a specified schema).
>>
>> But, the above is only for a one way path of custom array to Arrow array,
>> and not enough for a full roundtrip.
>>
>> For a full roundtrip in case of a pandas DataFrame, we will still need to
>> save information in metadata independently from __arrow_array__ and have
>> custom code in pyarrow to deal with pandas DataFrames (of which there is
>> already a lot). I mentioned this briefly in
>> https://github.com/pandas-dev/pandas/issues/20612/#issuecomment-489649556
>> / https://issues.apache.org/jira/browse/ARROW-2428, but one option could
>> be to save the name of the pandas extension dtype in the pandas_metadata of
>> an arrow Table (just as already happens for currently supported types), and
>> when exporting back to pandas with to_pandas pyarrow could check if this
>> extension dtype name is registered with pandas and if so, call a method
>> there to construct it.
>>
>> Joris
>>
>> Op do 9 mei 2019 om 17:38 schreef Antoine Pitrou <anto...@python.org>:
>>
>>>
>>> Hi Joris,
>>>
>>> Do you have a signature for __arrow_array__ method in mind?
>>>
>>> For example, let's say you want to roundtrip ExtensionArrays or other
>>> third-party data through Arrow.  How do you preserve the required
>>> metadata?
>>>
>>> Regards
>>>
>>> Antoine.
>>>
>>>
>>> Le 09/05/2019 à 13:29, Joris Van den Bossche a écrit :
>>>> Hi all,
>>>>
>>>> I want to propose an interface to allow custom array objects in Python
>>> to
>>>> define how they should be converted to Arrow arrays (e.g. in
>>>> pyarrow.array(..)). I opened
>>>> https://issues.apache.org/jira/browse/ARROW-5271 for this.
>>>> This would be similar to the numpy __array__ protocol (so we could eg
>>> call
>>>> it __arrow_array__).
>>>> Feedback / discussion very welcome!
>>>>
>>>> I am coming to this discussion specifically from the point of view of
>>>> pandas ExtensionArrays (github issue for this:
>>>>
>>> https://github.com/pandas-dev/pandas/issues/20612/#issuecomment-489649556
>>> ).
>>>> Such a protocol would, for example, make it possible that pandas users
>>> can
>>>> save DataFrames with ExtensionArrays (eg the nullable integers) to
>>> parquet,
>>>> without the need for pyarrow to know about all those possible different
>>>> extension arrays. This would also be useful for projects extending
>>> pandas
>>>> such as GeoPandas <https://github.com/geopandas/geopandas> and Fletcher
>>>> <https://github.com/xhochy/fletcher>.
>>>> But I suppose it could also be of interest more in general of other
>>>> array-like / pandas-like projects that want to interface with arrow.
>>>>
>>>> Sidenote: for the pandas case, I want to look a the full roundtrip, so
>>> also
>>>> the conversion back from an arrow Table to DataFrame. For that aspect
>>> there
>>>> is https://issues.apache.org/jira/browse/ARROW-2428, but this is much
>>> more
>>>> specific to pandas and its ExtensionArrays.
>>>>
>>>> Regards,
>>>> Joris
>>>>
>>>
>>
>

Re: [Discuss] [Python] protocol for conversion to pyarrow Array

Reply via email to