[jira] [Comment Edited] (ARROW-18400) [Python] Quadratic memory usage of Table.to_pandas with nested data

Anja Boskovic (Jira) Tue, 29 Nov 2022 09:49:06 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-18400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17640869#comment-17640869
 ]


Anja Boskovic edited comment on ARROW-18400 at 11/29/22 5:48 PM:
-----------------------------------------------------------------

The[ generic dataset API|[https://arrow.apache.org/docs/java/dataset.html]] 
currently supports Parquet, Arrow IPC, and CSV. I thought Arrow IPC made a 
natural comparison point to see if this issue was Parquet-specific.

I found that the quadratic memory leap happens for larger files for Arrow IPC 
than it does for Parquet, but it still happens.

 
||Num rows||Peak Memory Usage Parquet (MB)||Peak Memory Usage Arrow IPC (MB)||
|32,000|129|129|
|64,000|258|258|
|128,000|516|516|
|256,000|2,036|1,033|
|512,000|14,300|2,065|
|1,024,000|OOM (on my machine)|14,200|
|2,048,000| |OOM (on my machine)|

Here is the code that I used to run these tests. I am using the generic dataset 
API directly.

 
{code:java}
import pyarrow.dataset as ds 
import tracemalloc                                                              
                                                                                
                    
filename = "nested"                                                             
          
form = "feather"                                                                
                                                                                
                    
dataset = ds.dataset(f"{filename}.{form}", format=f"{form}")                    
          
table = dataset.to_table()                                                      
                                                                                
                   tracemalloc.start()                                          
                             
df = table.to_pandas()                                                          
          
stats = tracemalloc.take_snapshot().statistics("lineno")[:10]                   
          
for stat in stats:                                                              
          
    print(stat)                                                                 
          
tracemalloc.stop()      {code}


was (Author: JIRAUSER288952):
The[ generic dataset API|[https://arrow.apache.org/docs/java/dataset.html]] 
currently supports Parquet, Arrow IPC, and CSV. I thought Arrow IPC made a 
natural comparison point to see if this issue was Parquet-specific.

I found that the quadratic memory leap happens for larger files for Arrow IPC 
than it does for Parquet, but it still happens.

 
||Num rows||Peak Memory Usage Parquet (MB)||Peak Memory Usage Arrow IPC (MB)||
|32,000|129|129|
|64,000|258|258|
|128,000|516|516|
|256,000|2,036|1,033|
|512,000|14,300|2,065|
|1,024,000|OOM (on my machine)|14,200|
|2,048,000| |OOM (on my machine)|

Here is the code that I used to run these tests

 
{code:java}
import pyarrow.dataset as ds 
import tracemalloc                                                              
                                                                                
                    
filename = "nested"                                                             
          
form = "feather"                                                                
                                                                                
                    
dataset = ds.dataset(f"{filename}.{form}", format=f"{form}")                    
          
table = dataset.to_table()                                                      
                                                                                
                   tracemalloc.start()                                          
                             
df = table.to_pandas()                                                          
          
stats = tracemalloc.take_snapshot().statistics("lineno")[:10]                   
          
for stat in stats:                                                              
          
    print(stat)                                                                 
          
tracemalloc.stop()      {code}

> [Python] Quadratic memory usage of Table.to_pandas with nested data
> -------------------------------------------------------------------
>
>                 Key: ARROW-18400
>                 URL: https://issues.apache.org/jira/browse/ARROW-18400
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 10.0.1
>         Environment: Python 3.10.8 on Fedora Linux 36. AMD Ryzen 9 5900 X 
> with 64 GB RAM
>            Reporter: Adam Reeve
>            Assignee: Alenka Frim
>            Priority: Critical
>             Fix For: 11.0.0
>
>
> Reading nested Parquet data and then converting it to a Pandas DataFrame 
> shows quadratic memory usage and will eventually run out of memory for 
> reasonably small files. I had initially thought this was a regression since 
> 7.0.0, but it looks like 7.0.0 has similar quadratic memory usage that kicks 
> in at higher row counts.
> Example code to generate nested Parquet data:
> {code:python}
> import numpy as np
> import random
> import string
> import pandas as pd
> _characters = string.ascii_uppercase + string.digits + string.punctuation
> def make_random_string(N=10):
>     return ''.join(random.choice(_characters) for _ in range(N))
> nrows = 1_024_000
> filename = 'nested.parquet'
> arr_len = 10
> nested_col = []
> for i in range(nrows):
>     nested_col.append(np.array(
>             [{
>                 'a': None if i % 1000 == 0 else np.random.choice(10000, 
> size=3).astype(np.int64),
>                 'b': None if i % 100 == 0 else random.choice(range(100)),
>                 'c': None if i % 10 == 0 else make_random_string(5)
>             } for i in range(arr_len)]
>         ))
> df = pd.DataFrame({'c1': nested_col})
> df.to_parquet(filename)
> {code}
> And then read into a DataFrame with:
> {code:python}
> import pyarrow.parquet as pq
> table = pq.read_table(filename)
> df = table.to_pandas()
> {code}
> Only reading to an Arrow table isn't a problem, it's the to_pandas method 
> that exhibits the large memory usage. I haven't tested generating nested 
> Arrow data in memory without writing Parquet from Pandas but I assume the 
> problem probably isn't Parquet specific.
> Memory usage I see when reading different sized files on a machine with 64 GB 
> RAM:
> ||Num rows||Memory used with 10.0.1 (MB)||Memory used with 7.0.0 (MB)||
> |32,000|362|361|
> |64,000|531|531|
> |128,000|1,152|1,101|
> |256,000|2,888|1,402|
> |512,000|10,301|3,508|
> |1,024,000|38,697|5,313|
> |2,048,000|OOM|20,061|
> |4,096,000| |OOM|
> With Arrow 10.0.1, memory usage approximately quadruples when row count 
> doubles above 256k rows. With Arrow 7.0.0 memory usage is more linear but 
> then quadruples from 1024k to 2048k rows.
> PyArrow 8.0.0 shows similar memory usage to 10.0.1 so it looks like something 
> changed between 7.0.0 and 8.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (ARROW-18400) [Python] Quadratic memory usage of Table.to_pandas with nested data

Reply via email to