Joris Van den Bossche created ARROW-7591: --------------------------------------------
Summary: [Python] DictionaryArray.to_numpy returns dict of parts instead of numpy array Key: ARROW-7591 URL: https://issues.apache.org/jira/browse/ARROW-7591 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Currently, the {{to_numpy}} method doesn't return an ndarray incase of dictionaryd type data: {code} In [54]: a = pa.array(pd.Categorical(["a", "b", "a"])) In [55]: a Out[55]: <pyarrow.lib.DictionaryArray object at 0x7f5c63d98f28> -- dictionary: [ "a", "b" ] -- indices: [ 0, 1, 0 ] In [57]: a.to_numpy(zero_copy_only=False) Out[57]: {'indices': array([0, 1, 0], dtype=int8), 'dictionary': array(['a', 'b'], dtype=object), 'ordered': False} {code} This is actually just an internal representation that is passed from C++ to Python so on the Python side a {{pd.Categorical}} / {{CategoricalBlock}} can be constructed, but it's not something we should return as such to the user. Rather, I think we should return a decoded / dense numpy array (or at least error instead of returning this dict) (also, if the user wants those parts, they are already available from the dictionary array as {{a.indices}}, {{a.dictionary}} and {{a.type.ordered}}) -- This message was sent by Atlassian Jira (v8.3.4#803005)