[ https://issues.apache.org/jira/browse/ARROW-18400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17638729#comment-17638729 ]
Alenka Frim commented on ARROW-18400: ------------------------------------- I think it is Parquet specific. There seems to be an issue if data is read from {{.parquet}} format and then converted to pandas which doesn't happen if converting to a pandas dataframe from a pyarrow table. This is the code I am working with at the moment: {code:python} import numpy as np import random import string _characters = string.ascii_uppercase + string.digits + string.punctuation def make_random_string(N=10): return ''.join(random.choice(_characters) for _ in range(N)) nrows = 1_024_000 filename = 'nested_pandas.parquet' arr_len = 10 nested_col = [] for i in range(nrows): nested_col.append(np.array( [{ 'a': None if i % 1000 == 0 else np.random.choice(10000, size=3).astype(np.int64), 'b': None if i % 100 == 0 else random.choice(range(100)), 'c': None if i % 10 == 0 else make_random_string(5) } for i in range(arr_len)] )) import pyarrow as pa import pyarrow.parquet as pq table = pa.table({'c1': nested_col}) # Works correctly table.to_pandas() # c1 # 0 [{'a': None, 'b': None, 'c': None}, {'a': [399... # 1 [{'a': None, 'b': None, 'c': None}, {'a': [832... # 2 [{'a': None, 'b': None, 'c': None}, {'a': [731... # 3 [{'a': None, 'b': None, 'c': None}, {'a': [589... # 4 [{'a': None, 'b': None, 'c': None}, {'a': [159... # ... ... # 1023995 [{'a': None, 'b': None, 'c': None}, {'a': [922... # 1023996 [{'a': None, 'b': None, 'c': None}, {'a': [865... # 1023997 [{'a': None, 'b': None, 'c': None}, {'a': [222... # 1023998 [{'a': None, 'b': None, 'c': None}, {'a': [143... # 1023999 [{'a': None, 'b': None, 'c': None}, {'a': [287... # [1024000 rows x 1 columns] # Writing to .parquet and loading it into arrow again pq.write_table(table, filename) table_from_parquet = pq.read_table(filename) # Kill - converting to pandas table_from_parquet.to_pandas() print(tracemalloc.get_traced_memory()) # zsh: killed python memory_usage.py{code} I still have to look into what is causing it but there has to be some extra information being passed from parquet to arrow and then to pandas that is triggering this. Will research further next week. > [Python] Quadratic memory usage of Table.to_pandas with nested data > ------------------------------------------------------------------- > > Key: ARROW-18400 > URL: https://issues.apache.org/jira/browse/ARROW-18400 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 10.0.1 > Environment: Python 3.10.8 on Fedora Linux 36. AMD Ryzen 9 5900 X > with 64 GB RAM > Reporter: Adam Reeve > Assignee: Alenka Frim > Priority: Critical > Fix For: 11.0.0 > > > Reading nested Parquet data and then converting it to a Pandas DataFrame > shows quadratic memory usage and will eventually run out of memory for > reasonably small files. I had initially thought this was a regression since > 7.0.0, but it looks like 7.0.0 has similar quadratic memory usage that kicks > in at higher row counts. > Example code to generate nested Parquet data: > {code:python} > import numpy as np > import random > import string > import pandas as pd > _characters = string.ascii_uppercase + string.digits + string.punctuation > def make_random_string(N=10): > return ''.join(random.choice(_characters) for _ in range(N)) > nrows = 1_024_000 > filename = 'nested.parquet' > arr_len = 10 > nested_col = [] > for i in range(nrows): > nested_col.append(np.array( > [{ > 'a': None if i % 1000 == 0 else np.random.choice(10000, > size=3).astype(np.int64), > 'b': None if i % 100 == 0 else random.choice(range(100)), > 'c': None if i % 10 == 0 else make_random_string(5) > } for i in range(arr_len)] > )) > df = pd.DataFrame({'c1': nested_col}) > df.to_parquet(filename) > {code} > And then read into a DataFrame with: > {code:python} > import pyarrow.parquet as pq > table = pq.read_table(filename) > df = table.to_pandas() > {code} > Only reading to an Arrow table isn't a problem, it's the to_pandas method > that exhibits the large memory usage. I haven't tested generating nested > Arrow data in memory without writing Parquet from Pandas but I assume the > problem probably isn't Parquet specific. > Memory usage I see when reading different sized files on a machine with 64 GB > RAM: > ||Num rows||Memory used with 10.0.1 (MB)||Memory used with 7.0.0 (MB)|| > |32,000|362|361| > |64,000|531|531| > |128,000|1,152|1,101| > |256,000|2,888|1,402| > |512,000|10,301|3,508| > |1,024,000|38,697|5,313| > |2,048,000|OOM|20,061| > |4,096,000| |OOM| > With Arrow 10.0.1, memory usage approximately quadruples when row count > doubles above 256k rows. With Arrow 7.0.0 memory usage is more linear but > then quadruples from 1024k to 2048k rows. > PyArrow 8.0.0 shows similar memory usage to 10.0.1 so it looks like something > changed between 7.0.0 and 8.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010)