[ https://issues.apache.org/jira/browse/ARROW-11007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253676#comment-17253676 ]
Weston Pace commented on ARROW-11007: ------------------------------------- Hello, thank you for writing up this analysis. Pyarrow uses jemalloc, a custom memory allocator which does its best to hold onto memory allocated from the OS (since this can be an expensive operation). Unfortunately, this makes it difficult to track line by line memory usage with tools like memory_profiler. There are a couple of options: * You could use [https://arrow.apache.org/docs/python/generated/pyarrow.total_allocated_bytes.html#pyarrow.total_allocated_bytes] to track allocation instead of using memory_profiler (it might be interesting to see if there is a way to get memory_profile to use this function instead of kernel statistics). * You can also put the following line at the top of your script, this will configure jemalloc to release memory immediately instead of holding on to it (this will likely have some performance implications): pa.jemalloc_set_decay_ms(0) The behavior you are seeing is pretty typical for jemalloc. For further reading, in addition to reading up on jemalloc itself, I encourage you to take a look at these other issues for more discussions and examples of jemalloc behaviors: https://issues.apache.org/jira/browse/ARROW-6910 https://issues.apache.org/jira/browse/ARROW-7305 I have run your test read 10,000 times and it seems that memory usage does predictably stabilize. In addition, total_allocated_bytes is behaving exactly as expected. So I do not believe there is any evidence of a memory leak in this script. > [Python] Memory leak in pq.read_table and table.to_pandas > --------------------------------------------------------- > > Key: ARROW-11007 > URL: https://issues.apache.org/jira/browse/ARROW-11007 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 2.0.0 > Reporter: Michael Peleshenko > Priority: Major > > While upgrading our application to use pyarrow 2.0.0 instead of 0.12.1, we > observed a memory leak in the read_table and to_pandas methods. See below for > sample code to reproduce it. Memory does not seem to be returned after > deleting the table and df as it was in pyarrow 0.12.1. > *Sample Code* > {code:python} > import io > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > from memory_profiler import profile > @profile > def read_file(f): > table = pq.read_table(f) > df = table.to_pandas(strings_to_categorical=True) > del table > del df > def main(): > rows = 2000000 > df = pd.DataFrame({ > "string": ["test"] * rows, > "int": [5] * rows, > "float": [2.0] * rows, > }) > table = pa.Table.from_pandas(df, preserve_index=False) > parquet_stream = io.BytesIO() > pq.write_table(table, parquet_stream) > for i in range(3): > parquet_stream.seek(0) > read_file(parquet_stream) > if __name__ == '__main__': > main() > {code} > *Python 3.8.5 (conda), pyarrow 2.0.0 (pip), pandas 1.1.2 (pip) Logs* > {code:java} > Filename: C:/run_pyarrow_memoy_leak_sample.py > Line # Mem usage Increment Occurences Line Contents > ============================================================ > 9 161.7 MiB 161.7 MiB 1 @profile > 10 def read_file(f): > 11 212.1 MiB 50.4 MiB 1 table = pq.read_table(f) > 12 258.2 MiB 46.1 MiB 1 df = > table.to_pandas(strings_to_categorical=True) > 13 258.2 MiB 0.0 MiB 1 del table > 14 256.3 MiB -1.9 MiB 1 del df > Filename: C:/run_pyarrow_memoy_leak_sample.py > Line # Mem usage Increment Occurences Line Contents > ============================================================ > 9 256.3 MiB 256.3 MiB 1 @profile > 10 def read_file(f): > 11 279.2 MiB 23.0 MiB 1 table = pq.read_table(f) > 12 322.2 MiB 43.0 MiB 1 df = > table.to_pandas(strings_to_categorical=True) > 13 322.2 MiB 0.0 MiB 1 del table > 14 320.3 MiB -1.9 MiB 1 del df > Filename: C:/run_pyarrow_memoy_leak_sample.py > Line # Mem usage Increment Occurences Line Contents > ============================================================ > 9 320.3 MiB 320.3 MiB 1 @profile > 10 def read_file(f): > 11 326.9 MiB 6.5 MiB 1 table = pq.read_table(f) > 12 361.7 MiB 34.8 MiB 1 df = > table.to_pandas(strings_to_categorical=True) > 13 361.7 MiB 0.0 MiB 1 del table > 14 359.8 MiB -1.9 MiB 1 del df > {code} > *Python 3.5.6 (conda), pyarrow 0.12.1 (pip), pandas 0.24.1 (pip) Logs* > {code:java} > Filename: C:/run_pyarrow_memoy_leak_sample.py > Line # Mem usage Increment Occurences Line Contents > ============================================================ > 9 138.4 MiB 138.4 MiB 1 @profile > 10 def read_file(f): > 11 186.2 MiB 47.8 MiB 1 table = pq.read_table(f) > 12 219.2 MiB 33.0 MiB 1 df = > table.to_pandas(strings_to_categorical=True) > 13 171.7 MiB -47.5 MiB 1 del table > 14 139.3 MiB -32.4 MiB 1 del df > Filename: C:/run_pyarrow_memoy_leak_sample.py > Line # Mem usage Increment Occurences Line Contents > ============================================================ > 9 139.3 MiB 139.3 MiB 1 @profile > 10 def read_file(f): > 11 186.8 MiB 47.5 MiB 1 table = pq.read_table(f) > 12 219.2 MiB 32.4 MiB 1 df = > table.to_pandas(strings_to_categorical=True) > 13 171.5 MiB -47.7 MiB 1 del table > 14 139.1 MiB -32.4 MiB 1 del df > Filename: C:/run_pyarrow_memoy_leak_sample.py > Line # Mem usage Increment Occurences Line Contents > ============================================================ > 9 139.1 MiB 139.1 MiB 1 @profile > 10 def read_file(f): > 11 186.8 MiB 47.7 MiB 1 table = pq.read_table(f) > 12 219.2 MiB 32.4 MiB 1 df = > table.to_pandas(strings_to_categorical=True) > 13 171.8 MiB -47.5 MiB 1 del table > 14 139.3 MiB -32.4 MiB 1 del df > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)