Michal Glaus created ARROW-11607: ------------------------------------ Summary: [Python] Error when reading table with list values from parquet Key: ARROW-11607 URL: https://issues.apache.org/jira/browse/ARROW-11607 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 3.0.0, 2.0.0, 1.0.1, 1.0.0 Environment: Python 3.7 Reporter: Michal Glaus
I'm getting unexpected results when reading tables containing list values and a large number of rows from a parquet file. Example code (pyarrow 2.0.0 and 3.0.0): {code:java} from pyarrow import parquet, Table data = [None] * (1 << 20) data.append([1]) table = Table.from_arrays([data], ['column']) print('Expected: %s' % table['column'][-1]) parquet.write_table(table, 'table.parquet') table2 = parquet.read_table('table.parquet') print('Actual: %s' % table2['column'][-1]{code} Output: {noformat} Expected: [1] Actual: [0]{noformat} When I decrease the number of rows by 1 (by using (1 << 20) - 1), I get: {noformat} Expected: [1] Actual: [1]{noformat} For pyarrow 1.0.1 and 1.0.0, the threshold number of rows is 1 << 15. It seems that this is caused by some overflow and memory corruption because in pyarrow 3.0.0 with more complex values (list of dictionaries with float and datetime): {noformat} data.append([{'a': 0.1, 'b': datetime.now()}]) {noformat} I'm getting this exception after calling table2.to_pandas() : {noformat} /arrow/cpp/src/arrow/memory_pool.cc:501: Internal error: cannot create default memory pool{noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)