George Deamont created ARROW-11024:
--------------------------------------

             Summary: Writing List<Struct> to parquet sometimes writes wrong 
data
                 Key: ARROW-11024
                 URL: https://issues.apache.org/jira/browse/ARROW-11024
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 2.0.0
         Environment: macOS Catalina, Python 3.7.3, Pyarrow 2.0.0
            Reporter: George Deamont


 

Sometimes when writing tables that contain List<Struct> columns, the data is 
written incorrectly. Here is a code sample that produces the error. There are 
no exceptions raised here, but a simple equality check via equals() yields 
False for the second test case...

 
{code:java}
import pyarrow as pa
import pyarrow.parquet as pq

# Input records look like this...
# [
#     [{'x':'abc','y':'abc'}],
#     [{'x':'abc','y':'abc'}],
#     [{'x':'abc','y':'abc'}],
#     ...
#     [{'x':'abc','y':'gcb'}],
#     [{'x':'abc','y':'gcb'}],
#     [{'x':'abc','y':'gcb'}],
# ]

# Write small amount of data to parquet file, and read it back. In this case, 
both tables are equal.
data1 = [[{'x':'abc','y':'abc'}]]*100 + [[{'x':'abc','y':'gcb'}]]*100
array1 = pa.array(data1)
table1 = pa.table([array1],names=['column'])
pq.write_table(table1,'temp1.parquet')
table1_1 = pq.read_table('temp1.parquet')
print(table1_1.equals(table1))

# Write larger amount of data to parquet file, and read it back. In this case, 
the tables are not equal.
data2 = data1*100
array2 = pa.array(data2)
table2 = pa.table([array2],names=['column'])
pq.write_table(table2,'temp2.parquet')
table2_1 = pq.read_table('temp2.parquet')
print(table2_1.equals(table2))

{code}
 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to