Rasmus Johansen created ARROW-17733:
---------------------------------------

             Summary: [C++] Concatenating dictionary arrays with nulls fills 
wrong parts of index buffer with 0.
                 Key: ARROW-17733
                 URL: https://issues.apache.org/jira/browse/ARROW-17733
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++
            Reporter: Rasmus Johansen


When concatenating dictionary arrays with nulls, and whose index type is not 
8-bit wide the wrong bits of the index buffer get zeroed out.

Example using pyarrow:
{code:java}
import pyarrow as pa
dictionary_type = pa.dictionary(pa.int16(), pa.string())
empty_array = pa.array([], dictionary_type)
array1 = pa.array(["a", "b", None], dictionary_type)
array2 = pa.concat_arrays([empty_array, array1])
print(array1.to_pylist())
print(array2.to_pylist()) {code}
We would expect array1 and array2 to be the same, but this prints:
{noformat}
['a', 'b', None]
['a', 'a', None] {noformat}
 

This bug happens because the index type is 2-byte wide, so the null at position 
2 should result in zeroing out byte 4-5 (0-indexed) of the index buffer. 
However the code instead zeroes out byte 2-3 because we don't take into account 
the width of the index type when adding the position here:

https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/concatenate.cc#L314-L315



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to