[ https://issues.apache.org/jira/browse/ARROW-17813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joris Van den Bossche resolved ARROW-17813. ------------------------------------------- Fix Version/s: 10.0.0 Resolution: Fixed Issue resolved by pull request 14238 [https://github.com/apache/arrow/pull/14238] > [Python] Nested ExtensionArray conversion to/from pandas/numpy > -------------------------------------------------------------- > > Key: ARROW-17813 > URL: https://issues.apache.org/jira/browse/ARROW-17813 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 9.0.0 > Reporter: Chang She > Assignee: Miles Granger > Priority: Major > Labels: pull-request-available > Fix For: 10.0.0 > > Time Spent: 2h 50m > Remaining Estimate: 0h > > user@ thread: > [https://lists.apache.org/thread/dhnxq0g4kgdysjowftfv3z5ngj780xpb] > repro gist: > [https://gist.github.com/changhiskhan/4163f8cec675a2418a69ec9168d5fdd9] > *Arrow => numpy/pandas* > For a non-nested array, pa.ExtensionArray.to_numpy automatically "lowers" to > the storage type (as expected). However this is not done for nested arrays: > {code:python} > import pyarrow as pa > class LabelType(pa.ExtensionType): > def __init__(self): > super(LabelType, self).__init__(pa.string(), "label") > def __arrow_ext_serialize__(self): > return b"" > @classmethod > def __arrow_ext_deserialize__(cls, storage_type, serialized): > return LabelType() > > storage = pa.array(["dog", "cat", "horse"]) > ext_arr = pa.ExtensionArray.from_storage(LabelType(), storage) > offsets = pa.array([0, 1]) > list_arr = pa.ListArray.from_arrays(offsets, ext_arr) > list_arr.to_numpy() > {code} > {code:java} > --------------------------------------------------------------------------- > ArrowNotImplementedError Traceback (most recent call last) > Cell In [15], line 1 > ----> 1 list_arr.to_numpy() > File > /mnt/lance/.venv/lance/lib/python3.10/site-packages/pyarrow/array.pxi:1445, > in pyarrow.lib.Array.to_numpy() > File > /mnt/lance/.venv/lance/lib/python3.10/site-packages/pyarrow/error.pxi:121, in > pyarrow.lib.check_status() > ArrowNotImplementedError: Not implemented type for Arrow list to pandas: > extension<label<LabelType>> > {code} > As mentioned on the user thread linked from the top, a fairly generic > solution would just have the conversion default to the storage array's > to_numpy. > > *pandas/numpy => Arrow* > Equivalently, conversion to Arrow is also difficult for nested extension > types: > if I have say a pandas DataFrame that has a column of list-of-string and I > want to convert that to list-of-label Array. Currently I have to: > 1. Convert to list-of-string (storage) numpy array to pa.list_(pa.string()) > 2. Convert the string values array to ExtensionArray, then reconstitue a > list<extension> array using the ExtensionArray combined with the offsets from > the result of step 1 > {code:python} > import pyarrow as pa > import pandas as pd > df = pd.DataFrame({'labels': [["dog", "horse", "cat"], ["person", "person", > "car", "car"]]}) > list_of_storage = pa.array(df.labels) > ext_values = pa.ExtensionArray.from_storage(LabelType(), > list_of_storage.values) > list_of_ext = pa.ListArray.from_arrays(offsets=list_of_storage.offsets, > values=ext_values) > {code} > For non-nested columns, one can achieve easier conversion by defining a > pandas extension dtype, but i don't think that works for a nested column. You > would instead have to fallback to something like > `pa.ExtensionArray.from_storage` (or `from_pandas`?) to do the trick. Even > that doesn't necessarily work for something like a dictionary column because > you'd have to pass in the dictionary somehow. Off the cuff, one could provide > a custom lambda to `pa.Table.from_pandas` that is used for either specified > column names / data types? > Thanks in advance for the consideration! -- This message was sent by Atlassian Jira (v8.20.10#820010)