Pierre Belzile created ARROW-8006:
-------------------------------------

             Summary: unsafe arrow dictionary recovered from parquet
                 Key: ARROW-8006
                 URL: https://issues.apache.org/jira/browse/ARROW-8006
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++
    Affects Versions: 0.15.1
            Reporter: Pierre Belzile


When an arrow dictionary of values=strings and indices=intx is written to 
parquet and recovered, the indices that correspond to null positions are not 
written. This causes two problems:
 * when transposing the dictionary, the code encounters indices that are out of 
bounds with the existing dictionary. This does cause crashes.
 * a potential security risk because it's unclear whether bytes can be read 
back inadvertently.

I traced using GDB and found that:
 # My dictionary indices were decoded by RleDecoder::GetBatchSpaced. When the 
valid bit is unset, that function increments "out" but does not set it. I think 
it should write a 0. 
[https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/rle_encoding.h#L396]
 # The recovered data "out" array is written to the dictionary builder using an 
AppendIndices which moves the memory without as a bulk move without checking 
for nulls. Hence we end-up with the indices buffer holding the "out" from 
above. 
[https://github.com/apache/arrow/blob/master/cpp/src/parquet/encoding.cc#L1670 
|https://github.com/apache/arrow/blob/master/cpp/src/parquet/encoding.cc#L1670]When
 transpose runs on this 
([https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util.cc#L406]),
 it may attempt to access memory out of bounds.

While is would be possible to fix "transpose" and other functions that process 
dictionary indices (e.g. compare for sorting), it seems safer to initialize to 
0. Also that's the default behavior for the arrow dict builder when appending 
one or more nulls.

Incidentally the code recovers the dict with indices int32 instead of the 
original int8 but I guess that this is covered by another activity.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to