Matt Jadczak created ARROW-10246:
------------------------------------

             Summary: [Python] Incorrect conversion of Arrow dictionary to 
Parquet dictionary when duplicate values are present
                 Key: ARROW-10246
                 URL: https://issues.apache.org/jira/browse/ARROW-10246
             Project: Apache Arrow
          Issue Type: Bug
            Reporter: Matt Jadczak


Copying this from [the mailing 
list|https://lists.apache.org/thread.html/r8afb5aed3855e35fe03bd3a27f2c7e2177ed2825c5ad5f455b2c9078%40%3Cdev.arrow.apache.org%3E]

We can observe the following odd behaviour when round-tripping data via parquet 
using pyarrow, when the data contains dictionary arrays with duplicate values.

 
{code:java}
import pyarrow as pa
 import pyarrow.parquet as pq
my_table = pa.Table.from_batches(
 [
 pa.RecordBatch.from_arrays(
 [
 pa.array([0, 1, 2, 3, 4]),
 pa.DictionaryArray.from_arrays(
 pa.array([0, 1, 2, 3, 4]),
 pa.array(['a', 'd', 'c', 'd', 'e'])
 )
 ],
 names=['foo', 'bar']
 )
 ]
 )
 my_table.validate(full=True)
pq.write_table(my_table, "foo.parquet")
read_table = pq.ParquetFile("foo.parquet").read()
 read_table.validate(full=True)
print(my_table.column(1).to_pylist())
 print(read_table.column(1).to_pylist())
assert my_table.column(1).to_pylist() == read_table.column(1).to_pylist()
{code}
Both tables pass full validation, yet the last three lines print:


{code:java}
['a', 'd', 'c', 'd', 'e']
['a', 'd', 'c', 'e', 'a']
Traceback (most recent call last):
 File "/home/ataylor/projects/dsg-python-dtcc-equity-kinetics/dsg/example.py", 
line 29, in <module>
 assert my_table.column(1).to_pylist() == read_table.column(1).to_pylist()
AssertionError{code}
Which clearly doesn't look right!

 

It seems to me that the reason this is happening is that when re-encoding an 
Arrow dictionary as a Parquet one, the function at

[https://github.com/apache/arrow/blob/4bbb74713c6883e8523eeeb5ac80a1e1f8521674/cpp/src/parquet/encoding.cc#L773]

is called to create a Parquet DictEncoder out of the Arrow dictionary data. 
This internally uses a map from value to index, and this map is constructed by 
continually calling GetOrInsert on a memo table. When called with duplicate 
values as in Al's example, the duplicates do not cause a new dictionary index 
to be allocated, but instead return the existing one (which is just ignored). 
However, the caller assumes that the resulting Parquet dictionary uses the 
exact same indices as the Arrow one, and proceeds to just copy the index data 
directly. In Al's example, this results in an invalid dictionary index being 
written (that it is somehow wrapped around when reading again, rather than 
crashing, is potentially a second bug).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to