[ https://issues.apache.org/jira/browse/ARROW-17399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597832#comment-17597832 ]
Gianluca Ficarelli commented on ARROW-17399: -------------------------------------------- I tried with a fresh virtualenv, on both Linux and Mac (Intel): Linux (Ubuntu 20.04, 32 GB): {code:java} $ python -V Python 3.9.9 $ pip freeze numpy==1.23.2 pandas==1.4.3 psutil==5.9.1 pyarrow==9.0.0 python-dateutil==2.8.2 pytz==2022.2.1 six==1.16.0 $ python test_pyarrow.py 0 time: 0.0 rss: 90.8 1 time: 3.0 rss: 1205.7 2 time: 4.6 rss: 1212.6 3 time: 4.8 rss: 710.0 4 time: 8.0 rss: 708.2 5 time: 14.6 rss: 16652.9 6 time: 17.6 rss: 16242.9 7 time: 17.7 rss: 15743.5 8 time: 20.7 rss: 866.2 {code} Mac (Monterey 12.5, 16 GB): {code:java} $ python -V Python 3.9.9 $ pip freeze numpy==1.23.2 pandas==1.4.3 psutil==5.9.1 pyarrow==9.0.0 python-dateutil==2.8.2 pytz==2022.2.1 six==1.16.0 $ python test_pyarrow.py 0 time: 0.0 rss: 64.0 1 time: 4.0 rss: 1075.0 2 time: 6.2 rss: 1136.6 3 time: 6.8 rss: 671.8 4 time: 9.8 rss: 671.8 5 time: 22.9 rss: 2477.4 6 time: 25.9 rss: 2423.4 7 time: 27.1 rss: 180.6 8 time: 30.1 rss: 180.6 {code} but when the same script is retried there is some variability on Mac in the lines 5 and 6 (I observed from 1261 to 4140 MB), while on Linux is always the same. So it seems that the rss memory usage is high on linux only. > pyarrow may use a lot of memory to load a dataframe from parquet > ---------------------------------------------------------------- > > Key: ARROW-17399 > URL: https://issues.apache.org/jira/browse/ARROW-17399 > Project: Apache Arrow > Issue Type: Bug > Components: Parquet, Python > Affects Versions: 9.0.0 > Environment: linux > Reporter: Gianluca Ficarelli > Priority: Major > Attachments: memory-profiler.png > > > When a pandas dataframe is loaded from a parquet file using > {{{}pyarrow.parquet.read_table{}}}, the memory usage may grow a lot more than > what should be needed to load the dataframe, and it's not freed until the > dataframe is deleted. > The problem is evident when the dataframe has a {*}column containing lists or > numpy arrays{*}, while it seems absent (or not noticeable) if the column > contains only integer or floats. > I'm attaching a simple script to reproduce the issue, and a graph created > with memory-profiler showing the memory usage. > In this example, the dataframe created with pandas needs around 1.2 GB, but > the memory usage after loading it from parquet is around 16 GB. > The items of the column are created as numpy arrays and not lists, to be > consistent with the types loaded from parquet (pyarrow produces numpy arrays > and not lists). > > {code:python} > import gc > import time > import numpy as np > import pandas as pd > import pyarrow > import pyarrow.parquet > import psutil > def pyarrow_dump(filename, df, compression="snappy"): > table = pyarrow.Table.from_pandas(df) > pyarrow.parquet.write_table(table, filename, compression=compression) > def pyarrow_load(filename): > table = pyarrow.parquet.read_table(filename) > return table.to_pandas() > def print_mem(msg, start_time=time.monotonic(), process=psutil.Process()): > # gc.collect() > current_time = time.monotonic() - start_time > rss = process.memory_info().rss / 2 ** 20 > print(f"{msg:>3} time:{current_time:>10.1f} rss:{rss:>10.1f}") > if __name__ == "__main__": > print_mem(0) > rows = 5000000 > df = pd.DataFrame({"a": [np.arange(10) for i in range(rows)]}) > print_mem(1) > > pyarrow_dump("example.parquet", df) > print_mem(2) > > del df > print_mem(3) > time.sleep(3) > print_mem(4) > df = pyarrow_load("example.parquet") > print_mem(5) > time.sleep(3) > print_mem(6) > del df > print_mem(7) > time.sleep(3) > print_mem(8) > {code} > Run with memory-profiler: > {code:bash} > mprof run --multiprocess python test_pyarrow.py > {code} > Output: > {code:java} > mprof: Sampling memory every 0.1s > running new process > 0 time: 0.0 rss: 135.4 > 1 time: 4.9 rss: 1252.2 > 2 time: 7.1 rss: 1265.0 > 3 time: 7.5 rss: 760.2 > 4 time: 10.7 rss: 758.9 > 5 time: 19.6 rss: 16745.4 > 6 time: 22.6 rss: 16335.4 > 7 time: 22.9 rss: 15833.0 > 8 time: 25.9 rss: 955.0 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)