[ 
https://issues.apache.org/jira/browse/ARROW-9594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325111#comment-17325111
 ] 

Alessandro Molina edited comment on ARROW-9594 at 4/19/21, 3:11 PM:
--------------------------------------------------------------------

The issue seems to be caused by {{ConvertArrayToPandas}} returning {{-1}} for 
missing entries. 
When we map the values to the indices using {{np.take}} the result is that 
those negative indices wrap around and end up picking the last value

{code:python}
>>> d = np.array(['foo', 'bar'])
>>> i = np.array([ 0,  1, -1,  0])
>>> np.take(d, i)
array(['foo', 'bar', 'bar', 'foo'], dtype='<U3')
{code}

When converting to pandas this doesn't happen because {{pandas.Categorical}} 
already does return {{NaN}}  for indices that point to a non existing value ( 
https://pandas.pydata.org/docs/reference/api/pandas.Categorical.html#pandas-categorical
 )





was (Author: amol-):
The issue seems to be caused by {{ConvertArrayToPandas}} returning {{-1}} for 
missing entries. 
When we map the values to the indices using {{np.take}} the result is that 
those negative indices wrap around and end up picking the last value

{code:python}
>>> d = np.array(['foo', 'bar'])
>>> i = np.array([ 0,  1, -1,  0])
>>> np.take(d, i)
array(['foo', 'bar', 'bar', 'foo'], dtype='<U3')
{code}

When converting to pandas this doesn't happen because {{pandas.Categorical}} 
already does return {{NaN}}  for an indices that point to a non existing value 
( 
https://pandas.pydata.org/docs/reference/api/pandas.Categorical.html#pandas-categorical
 )




> [Python] DictionaryArray.to_numpy does not correctly convert null indexes to 
> null values
> ----------------------------------------------------------------------------------------
>
>                 Key: ARROW-9594
>                 URL: https://issues.apache.org/jira/browse/ARROW-9594
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 1.0.0
>            Reporter: Steve M. Kim
>            Priority: Major
>             Fix For: 5.0.0
>
>
> Example
> {code:java}
> >>> a = pa.DictionaryArray.from_arrays(pa.array([0, 1, None, 0], 
> >>> type=pa.int32()), pa.array(['foo', 'bar']))
> >>> a
> <pyarrow.lib.DictionaryArray object at 0x7f12fc94ccf0>-- dictionary:
>   [
>     "foo",
>     "bar"
>   ]
> -- indices:
>   [
>     0,
>     1,
>     null,
>     0
>   ]
> >>> a.to_pandas()  # this works
> 0    foo
> 1    bar
> 2    NaN
> 3    foo
> dtype: category
> Categories (2, object): [foo, bar]
> >>> a.to_numpy(zero_copy_only=False)  # this is broken
> array(['foo', 'bar', 'bar', 'foo'], dtype=object)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to