[ 
https://issues.apache.org/jira/browse/ARROW-18400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653642#comment-17653642
 ] 

Weston Pace commented on ARROW-18400:
-------------------------------------

{quote}
The reason this happens for parquet and not for feather in this case, is 
because the Parquet file actually consists of a single row group (and I assume 
the dataset API will therefore still read that in one go, and then slice output 
batches from that to return the expected batch size in the dataset API), while 
the feather file already consists of multiple batches on disk (and thus doesn't 
result in sliced batches in memory).
{quote}

Yes.  The dataset API processes input in fairly small batches (32ki rows).  
Partly because this is cache friendly but also because some of the hash-join 
code uses 16-bit signed integers for row indices.  The scanner does not support 
partial reading of row groups from parquet files (I would very much like to 
support this someday) and so it reads the entire row group in one chunk.  Then 
it slices that chunk.

It sounds like this numpy conversion bug should be fixed regardless.

I wonder if we also want to someday support better output batch size controls 
as well.  I'll create an issue for it.

> [Python] Quadratic memory usage of Table.to_pandas with nested data
> -------------------------------------------------------------------
>
>                 Key: ARROW-18400
>                 URL: https://issues.apache.org/jira/browse/ARROW-18400
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 10.0.1
>         Environment: Python 3.10.8 on Fedora Linux 36. AMD Ryzen 9 5900 X 
> with 64 GB RAM
>            Reporter: Adam Reeve
>            Assignee: Alenka Frim
>            Priority: Critical
>             Fix For: 11.0.0
>
>         Attachments: test_memory.py
>
>
> Reading nested Parquet data and then converting it to a Pandas DataFrame 
> shows quadratic memory usage and will eventually run out of memory for 
> reasonably small files. I had initially thought this was a regression since 
> 7.0.0, but it looks like 7.0.0 has similar quadratic memory usage that kicks 
> in at higher row counts.
> Example code to generate nested Parquet data:
> {code:python}
> import numpy as np
> import random
> import string
> import pandas as pd
> _characters = string.ascii_uppercase + string.digits + string.punctuation
> def make_random_string(N=10):
>     return ''.join(random.choice(_characters) for _ in range(N))
> nrows = 1_024_000
> filename = 'nested.parquet'
> arr_len = 10
> nested_col = []
> for i in range(nrows):
>     nested_col.append(np.array(
>             [{
>                 'a': None if i % 1000 == 0 else np.random.choice(10000, 
> size=3).astype(np.int64),
>                 'b': None if i % 100 == 0 else random.choice(range(100)),
>                 'c': None if i % 10 == 0 else make_random_string(5)
>             } for i in range(arr_len)]
>         ))
> df = pd.DataFrame({'c1': nested_col})
> df.to_parquet(filename)
> {code}
> And then read into a DataFrame with:
> {code:python}
> import pyarrow.parquet as pq
> table = pq.read_table(filename)
> df = table.to_pandas()
> {code}
> Only reading to an Arrow table isn't a problem, it's the to_pandas method 
> that exhibits the large memory usage. I haven't tested generating nested 
> Arrow data in memory without writing Parquet from Pandas but I assume the 
> problem probably isn't Parquet specific.
> Memory usage I see when reading different sized files on a machine with 64 GB 
> RAM:
> ||Num rows||Memory used with 10.0.1 (MB)||Memory used with 7.0.0 (MB)||
> |32,000|362|361|
> |64,000|531|531|
> |128,000|1,152|1,101|
> |256,000|2,888|1,402|
> |512,000|10,301|3,508|
> |1,024,000|38,697|5,313|
> |2,048,000|OOM|20,061|
> |4,096,000| |OOM|
> With Arrow 10.0.1, memory usage approximately quadruples when row count 
> doubles above 256k rows. With Arrow 7.0.0 memory usage is more linear but 
> then quadruples from 1024k to 2048k rows.
> PyArrow 8.0.0 shows similar memory usage to 10.0.1 so it looks like something 
> changed between 7.0.0 and 8.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to