Brent Kerby created ARROW-2515:
----------------------------------

             Summary: Errors with DictionaryArray inside of ListArray or other 
DictionaryArray
                 Key: ARROW-2515
                 URL: https://issues.apache.org/jira/browse/ARROW-2515
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.9.0
            Reporter: Brent Kerby


An exception ("KeyError: 26") is raised when .as_py() is called on elements of 
a ListArray over a DictionaryArray, or of a DictionaryArray with values in a 
DictionaryArray. Here are a couple tests that currently fail:

 
{code:java}
import pyarrow as pa

def test_dictionary_array_1():
    dict_arr = pa.DictionaryArray.from_arrays([0, 1, 0], ['a', 'b'])
    list_arr = pa.ListArray.from_arrays([0, 2, 3], dict_arr)
    assert list_arr.to_pylist() == [['a', 'b'], ['a']]

def test_dictionary_array_2():
    dict_arr = pa.DictionaryArray.from_arrays([0, 1, 0], ['a', 'b'])
    dict_arr2 = pa.DictionaryArray.from_arrays([0, 1, 2, 1, 0], dict_arr)
    assert dict_arr2.to_pylist() == ['a', 'b', 'a', 'b', 'a']
{code}
It appears that the problem is caused by the fact that the function box_scalar 
in scalar.pxi does not handle the case of dictionary array, as we currently 
have no DictionaryValue type. 

 

DictionaryArray.__getitem__ currently works around the lack of DictionaryValue 
type by dereferencing the index and constructs a scalar based on the value in 
the underlying dictionary. In other words, if we have a dictionary with int8 
indices and string values, then the result of __getitem__ will be a StringValue 
(rather than a DictionaryValue). This works in simple cases but not in the more 
complex scenarios illustrated above.

I have a patch ready, which would add a DictionaryValue type similar to other 
scalar types, resolving these bugs and removing the need for a special-cased 
implementation of DictionaryArray.__getitem__. This DictionaryValue would 
contain a couple accessor properties, "indices_value" and "dictionary_value" to 
allow access to both the index in the dictionary as well as the looked-up 
value. Then DictionaryValue.as_py() would simply call .as_py() on the 
underlying dictionary_value. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to