numpy

Chang She (Jira) Wed, 21 Sep 2022 19:49:31 -0700

Chang She created ARROW-17813:
---------------------------------

             Summary: [Python] Nested ExtensionArray conversion to/from 
pandas/numpy
                 Key: ARROW-17813
                 URL: https://issues.apache.org/jira/browse/ARROW-17813
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 9.0.0
            Reporter: Chang She



user@ thread: [https://lists.apache.org/thread/dhnxq0g4kgdysjowftfv3z5ngj780xpb]
repro gist: 
[https://gist.github.com/changhiskhan/4163f8cec675a2418a69ec9168d5fdd9]

*Arrow => numpy/pandas*

For a non-nested array, pa.ExtensionArray.to_numpy automatically "lowers" to 
the storage type (as expected). However this is not done for nested arrays:

{code:python}
import pyarrow as pa

class LabelType(pa.ExtensionType):

    def __init__(self):
        super(LabelType, self).__init__(pa.string(), "label")

    def __arrow_ext_serialize__(self):
        return b""

    @classmethod
    def __arrow_ext_deserialize__(cls, storage_type, serialized):
        return LabelType()
    
storage = pa.array(["dog", "cat", "horse"])
ext_arr = pa.ExtensionArray.from_storage(LabelType(), storage)
offsets = pa.array([0, 1])
list_arr = pa.ListArray.from_arrays(offsets, ext_arr)
list_arr.to_numpy()
{code}
{code:java}
---------------------------------------------------------------------------
ArrowNotImplementedError                  Traceback (most recent call last)
Cell In [15], line 1
----> 1 list_arr.to_numpy()

File 
/mnt/lance/.venv/lance/lib/python3.10/site-packages/pyarrow/array.pxi:1445, in 
pyarrow.lib.Array.to_numpy()

File /mnt/lance/.venv/lance/lib/python3.10/site-packages/pyarrow/error.pxi:121, 
in pyarrow.lib.check_status()

ArrowNotImplementedError: Not implemented type for Arrow list to pandas: 
extension<label<LabelType>>
{code}

As mentioned on the user thread linked from the top, a fairly generic solution 
would just have the conversion default to the storage array's to_numpy.

 
*pandas/numpy => Arrow*

Equivalently, conversion to Arrow is also difficult for nested extension types: 

if I have say a pandas DataFrame that has a column of list-of-string and I want 
to convert that to list-of-label Array. Currently I have to:
1. Convert to list-of-string (storage) numpy array to pa.list_(pa.string())
2. Convert the string values array to ExtensionArray, then reconstitue a 
list<extension> array using the ExtensionArray combined with the offsets from 
the result of step 1

{code:python}
import pyarrow as pa
import pandas as pd
df = pd.DataFrame({'labels': [["dog", "horse", "cat"], ["person", "person", 
"car", "car"]]})
list_of_storage = pa.array(df.labels)
ext_values = pa.ExtensionArray.from_storage(LabelType(), list_of_storage.values)
list_of_ext = pa.ListArray.from_arrays(offsets=list_of_storage.offsets, 
values=ext_values)
{code}


For non-nested columns, one can achieve easier conversion by defining a pandas 
extension dtype, but i don't think that works for a nested column. You would 
instead have to fallback to something like `pa.ExtensionArray.from_storage` (or 
`from_pandas`?) to do the trick. Even that doesn't necessarily work for 
something like a dictionary column because you'd have to pass in the dictionary 
somehow. Off the cuff, one could provide a custom lambda to 
`pa.Table.from_pandas` that is used for either specified column names / data 
types?


Thanks in advance for the consideration!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17813) [Python] Nested ExtensionArray conversion to/from pandas/numpy

Reply via email to