[ https://issues.apache.org/jira/browse/ARROW-9594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325111#comment-17325111 ]
Alessandro Molina edited comment on ARROW-9594 at 4/19/21, 3:11 PM: -------------------------------------------------------------------- The issue seems to be caused by {{ConvertArrayToPandas}} returning {{-1}} for missing entries. When we map the values to the indices using {{np.take}} the result is that those negative indices wrap around and end up picking the last value {code:python} >>> d = np.array(['foo', 'bar']) >>> i = np.array([ 0, 1, -1, 0]) >>> np.take(d, i) array(['foo', 'bar', 'bar', 'foo'], dtype='<U3') {code} When converting to pandas this doesn't happen because {{pandas.Categorical}} already does return {{NaN}} for indices that point to a non existing value ( https://pandas.pydata.org/docs/reference/api/pandas.Categorical.html#pandas-categorical ) was (Author: amol-): The issue seems to be caused by {{ConvertArrayToPandas}} returning {{-1}} for missing entries. When we map the values to the indices using {{np.take}} the result is that those negative indices wrap around and end up picking the last value {code:python} >>> d = np.array(['foo', 'bar']) >>> i = np.array([ 0, 1, -1, 0]) >>> np.take(d, i) array(['foo', 'bar', 'bar', 'foo'], dtype='<U3') {code} When converting to pandas this doesn't happen because {{pandas.Categorical}} already does return {{NaN}} for an indices that point to a non existing value ( https://pandas.pydata.org/docs/reference/api/pandas.Categorical.html#pandas-categorical ) > [Python] DictionaryArray.to_numpy does not correctly convert null indexes to > null values > ---------------------------------------------------------------------------------------- > > Key: ARROW-9594 > URL: https://issues.apache.org/jira/browse/ARROW-9594 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 1.0.0 > Reporter: Steve M. Kim > Priority: Major > Fix For: 5.0.0 > > > Example > {code:java} > >>> a = pa.DictionaryArray.from_arrays(pa.array([0, 1, None, 0], > >>> type=pa.int32()), pa.array(['foo', 'bar'])) > >>> a > <pyarrow.lib.DictionaryArray object at 0x7f12fc94ccf0>-- dictionary: > [ > "foo", > "bar" > ] > -- indices: > [ > 0, > 1, > null, > 0 > ] > >>> a.to_pandas() # this works > 0 foo > 1 bar > 2 NaN > 3 foo > dtype: category > Categories (2, object): [foo, bar] > >>> a.to_numpy(zero_copy_only=False) # this is broken > array(['foo', 'bar', 'bar', 'foo'], dtype=object) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)