George Deamont created ARROW-11024: -------------------------------------- Summary: Writing List<Struct> to parquet sometimes writes wrong data Key: ARROW-11024 URL: https://issues.apache.org/jira/browse/ARROW-11024 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 2.0.0 Environment: macOS Catalina, Python 3.7.3, Pyarrow 2.0.0 Reporter: George Deamont
Sometimes when writing tables that contain List<Struct> columns, the data is written incorrectly. Here is a code sample that produces the error. There are no exceptions raised here, but a simple equality check via equals() yields False for the second test case... {code:java} import pyarrow as pa import pyarrow.parquet as pq # Input records look like this... # [ # [{'x':'abc','y':'abc'}], # [{'x':'abc','y':'abc'}], # [{'x':'abc','y':'abc'}], # ... # [{'x':'abc','y':'gcb'}], # [{'x':'abc','y':'gcb'}], # [{'x':'abc','y':'gcb'}], # ] # Write small amount of data to parquet file, and read it back. In this case, both tables are equal. data1 = [[{'x':'abc','y':'abc'}]]*100 + [[{'x':'abc','y':'gcb'}]]*100 array1 = pa.array(data1) table1 = pa.table([array1],names=['column']) pq.write_table(table1,'temp1.parquet') table1_1 = pq.read_table('temp1.parquet') print(table1_1.equals(table1)) # Write larger amount of data to parquet file, and read it back. In this case, the tables are not equal. data2 = data1*100 array2 = pa.array(data2) table2 = pa.table([array2],names=['column']) pq.write_table(table2,'temp2.parquet') table2_1 = pq.read_table('temp2.parquet') print(table2_1.equals(table2)) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)