[jira] [Created] (ARROW-11607) [Python] Error when reading table with list values from parquet

Michal Glaus (Jira) Fri, 12 Feb 2021 04:37:08 -0800

Michal Glaus created ARROW-11607:
------------------------------------

             Summary: [Python] Error when reading table with list values from 
parquet
                 Key: ARROW-11607
                 URL: https://issues.apache.org/jira/browse/ARROW-11607
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 3.0.0, 2.0.0, 1.0.1, 1.0.0
         Environment: Python 3.7
            Reporter: Michal Glaus



I'm getting unexpected results when reading tables containing list values and a 
large number of rows from a parquet file.

Example code (pyarrow 2.0.0 and 3.0.0):
{code:java}
from pyarrow import parquet, Table

data = [None] * (1 << 20)
data.append([1])

table = Table.from_arrays([data], ['column'])
print('Expected: %s' % table['column'][-1])

parquet.write_table(table, 'table.parquet')

table2 = parquet.read_table('table.parquet')
print('Actual:   %s' % table2['column'][-1]{code}
Output:
{noformat}
Expected: [1]
Actual:   [0]{noformat}
When I decrease the number of rows by 1 (by using (1 << 20) - 1), I get:
{noformat}
Expected: [1]
Actual:   [1]{noformat}

For pyarrow 1.0.1 and 1.0.0, the threshold number of rows is 1 << 15.

It seems that this is caused by some overflow and memory corruption because in 
pyarrow 3.0.0 with more complex values (list of dictionaries with float and 
datetime):
{noformat}
data.append([{'a': 0.1, 'b': datetime.now()}])
{noformat}
I'm getting this exception after calling table2.to_pandas() :
{noformat}
/arrow/cpp/src/arrow/memory_pool.cc:501: Internal error: cannot create default 
memory pool{noformat}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-11607) [Python] Error when reading table with list values from parquet

Reply via email to