[ https://issues.apache.org/jira/browse/ARROW-18156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17624612#comment-17624612 ]
Weston Pace commented on ARROW-18156: ------------------------------------- Another experiment might be adding a five second sleep between the to_pandas call and the print. And then a further experiment might be adding an explicit python garbage collection call in addition to the sleep. > [Python/C++] High memory usage/potential leak when reading parquet using > Dataset API > ------------------------------------------------------------------------------------ > > Key: ARROW-18156 > URL: https://issues.apache.org/jira/browse/ARROW-18156 > Project: Apache Arrow > Issue Type: Bug > Components: Parquet > Affects Versions: 4.0.1 > Reporter: Norbert > Priority: Major > > Hi, > I have a 2.35 GB DataFrame (1.17 GB on-disk size) which I'm loading using the > following snippet: > > {code:java} > import os > import pyarrow > import pyarrow.dataset as ds > from importlib_metadata import version > from psutil import Process > import pyarrow.parquet as pq > def format_bytes(num_bytes: int): > return f"{num_bytes / 1024 / 1024 / 1024:.2f} GB" > > def main(): > print(version("pyarrow")) > print(pyarrow.default_memory_pool().backend_name) > process = Process(os.getpid()) > runs = 10 > print(f"Runs: {runs}") > for i in range(runs): > dataset = ds.dataset("df.pq") > table = dataset.to_table() > df = table.to_pandas() > print(f"After run {i}: RSS = > {format_bytes(process.memory_info().rss)}, PyArrow Allocated Bytes = > {format_bytes(pyarrow.total_allocated_bytes())}") > {code} > > > On PyArrow v4.0.1 the output is as follows: > {code:java} > 4.0.1 > system > Runs: 10 > After run 0: RSS = 7.59 GB, PyArrow Allocated Bytes = 6.09 GB > After run 1: RSS = 13.36 GB, PyArrow Allocated Bytes = 6.09 GB > After run 2: RSS = 14.74 GB, PyArrow Allocated Bytes = 6.09 GB > After run 3: RSS = 15.78 GB, PyArrow Allocated Bytes = 6.09 GB > After run 4: RSS = 18.36 GB, PyArrow Allocated Bytes = 6.09 GB > After run 5: RSS = 19.69 GB, PyArrow Allocated Bytes = 6.09 GB > After run 6: RSS = 21.21 GB, PyArrow Allocated Bytes = 6.09 GB > After run 7: RSS = 21.52 GB, PyArrow Allocated Bytes = 6.09 GB > After run 8: RSS = 21.49 GB, PyArrow Allocated Bytes = 6.09 GB > After run 9: RSS = 21.72 GB, PyArrow Allocated Bytes = 6.09 GB > After run 10: RSS = 20.95 GB, PyArrow Allocated Bytes = 6.09 GB{code} > If I replace ds.dataset("df.pq").to_table() with > pq.ParquetFile("df.pq").read(), the output is: > {code:java} > 4.0.1 > system > Runs: 10 > After run 0: RSS = 2.38 GB, PyArrow Allocated Bytes = 1.34 GB > After run 1: RSS = 2.49 GB, PyArrow Allocated Bytes = 1.34 GB > After run 2: RSS = 2.50 GB, PyArrow Allocated Bytes = 1.34 GB > After run 3: RSS = 2.53 GB, PyArrow Allocated Bytes = 1.34 GB > After run 4: RSS = 2.53 GB, PyArrow Allocated Bytes = 1.34 GB > After run 5: RSS = 2.56 GB, PyArrow Allocated Bytes = 1.34 GB > After run 6: RSS = 2.53 GB, PyArrow Allocated Bytes = 1.34 GB > After run 7: RSS = 2.51 GB, PyArrow Allocated Bytes = 1.34 GB > After run 8: RSS = 2.48 GB, PyArrow Allocated Bytes = 1.34 GB > After run 9: RSS = 2.51 GB, PyArrow Allocated Bytes = 1.34 GB > After run 10: RSS = 2.51 GB, PyArrow Allocated Bytes = 1.34 GB{code} > The usage profile of the older non-dataset API is much lower - it matches the > size of the dataframe much closer. It also seems like in the former example, > there is a memory leak? I thought that the increase in RSS was just due to > PyArrow's usage of jemalloc, but I seem to be using the system allocator here. > -- This message was sent by Atlassian Jira (v8.20.10#820010)