Joris Van den Bossche created ARROW-7591:
--------------------------------------------

             Summary: [Python] DictionaryArray.to_numpy returns dict of parts 
instead of numpy array
                 Key: ARROW-7591
                 URL: https://issues.apache.org/jira/browse/ARROW-7591
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
            Reporter: Joris Van den Bossche


Currently, the {{to_numpy}} method doesn't return an ndarray incase of 
dictionaryd type data:

{code}
In [54]: a = pa.array(pd.Categorical(["a", "b", "a"]))                          
                                                                                
                                                   

In [55]: a                                                                      
                                                                                
                                                   
Out[55]: 
<pyarrow.lib.DictionaryArray object at 0x7f5c63d98f28>

-- dictionary:
  [
    "a",
    "b"
  ]
-- indices:
  [
    0,
    1,
    0
  ]

In [57]: a.to_numpy(zero_copy_only=False)                                       
                                                                                
                                                   
Out[57]: 
{'indices': array([0, 1, 0], dtype=int8),
 'dictionary': array(['a', 'b'], dtype=object),
 'ordered': False}
{code}

This is actually just an internal representation that is passed from C++ to 
Python so on the Python side a {{pd.Categorical}} / {{CategoricalBlock}} can be 
constructed, but it's not something we should return as such to the user. 
Rather, I think we should return a decoded / dense numpy array (or at least 
error instead of returning this dict)

(also, if the user wants those parts, they are already available from the 
dictionary array as {{a.indices}}, {{a.dictionary}} and {{a.type.ordered}})



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to