Adam Reeve created ARROW-18400:
----------------------------------

             Summary: Quadratic memory usage of Table.to_pandas with nested data
                 Key: ARROW-18400
                 URL: https://issues.apache.org/jira/browse/ARROW-18400
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 10.0.1
         Environment: Python 3.10.8 on Fedora Linux 36. AMD Ryzen 9 5900 X with 
64 GB RAM
            Reporter: Adam Reeve


Reading nested Parquet data and then converting it to a Pandas DataFrame shows 
quadratic memory usage and will eventually run out of memory for reasonably 
small files. I had initially thought this was a regression since 7.0.0, but it 
looks like 7.0.0 has similar quadratic memory usage that kicks in at higher row 
counts.

Example code to generate nested Parquet data:
{code:python}
import numpy as np
import random
import string
import pandas as pd

_characters = string.ascii_uppercase + string.digits + string.punctuation

def make_random_string(N=10):
    return ''.join(random.choice(_characters) for _ in range(N))

nrows = 1_024_000
filename = 'nested.parquet'

arr_len = 10
nested_col = []
for i in range(nrows):
    nested_col.append(np.array(
            [{
                'a': None if i % 1000 == 0 else np.random.choice(10000, 
size=3).astype(np.int64),
                'b': None if i % 100 == 0 else random.choice(range(100)),
                'c': None if i % 10 == 0 else make_random_string(5)
            } for i in range(arr_len)]
        ))
df = pd.DataFrame({'c1': nested_col})
df.to_parquet(filename)
{code}
And then read into a DataFrame with:
{code:python}
import pyarrow.parquet as pq
table = pq.read_table(filename)
df = table.to_pandas()
{code}
Only reading to an Arrow table isn't a problem, it's the to_pandas method that 
exhibits the large memory usage. I haven't tested generating nested Arrow data 
in memory without writing Parquet from Pandas but I assume the problem probably 
isn't Parquet specific.

Memory usage I see when reading different sized files:
||Num rows||Memory used with 10.0.1 (MB)||Memory used with 7.0.0 (MB)||
|32,000|362|361|
|64,000|531|531|
|128,000|1,152|1,101|
|256,000|2,888|1,402|
|512,000|10,301|3,508|
|1,024,000|38,697|5,313|
|2,048,000|OOM|20,061|

With Arrow 10.0.1, memory usage approximately quadruples when row count doubles 
above 256k rows. With Arrow 7.0.0 memory usage is more linear but then 
quadruples from 1024k to 2048k rows.

PyArrow 8.0.0 shows similar memory usage to 10.0.1 so it looks like something 
changed between 7.0.0 and 8.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to