[ https://issues.apache.org/jira/browse/ARROW-11007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279664#comment-17279664 ]
Antoine Pitrou commented on ARROW-11007: ---------------------------------------- As you can see, the memory was returned to the allocator ("0 allocated"). The allocator is then free to return those pages to the OS or not. Also, how is "Mem usage" measured in your script? > [Python] Memory leak in pq.read_table and table.to_pandas > --------------------------------------------------------- > > Key: ARROW-11007 > URL: https://issues.apache.org/jira/browse/ARROW-11007 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 2.0.0 > Reporter: Michael Peleshenko > Priority: Major > > While upgrading our application to use pyarrow 2.0.0 instead of 0.12.1, we > observed a memory leak in the read_table and to_pandas methods. See below for > sample code to reproduce it. Memory does not seem to be returned after > deleting the table and df as it was in pyarrow 0.12.1. > *Sample Code* > {code:python} > import io > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > from memory_profiler import profile > @profile > def read_file(f): > table = pq.read_table(f) > df = table.to_pandas(strings_to_categorical=True) > del table > del df > def main(): > rows = 2000000 > df = pd.DataFrame({ > "string": ["test"] * rows, > "int": [5] * rows, > "float": [2.0] * rows, > }) > table = pa.Table.from_pandas(df, preserve_index=False) > parquet_stream = io.BytesIO() > pq.write_table(table, parquet_stream) > for i in range(3): > parquet_stream.seek(0) > read_file(parquet_stream) > if __name__ == '__main__': > main() > {code} > *Python 3.8.5 (conda), pyarrow 2.0.0 (pip), pandas 1.1.2 (pip) Logs* > {code:java} > Filename: C:/run_pyarrow_memoy_leak_sample.py > Line # Mem usage Increment Occurences Line Contents > ============================================================ > 9 161.7 MiB 161.7 MiB 1 @profile > 10 def read_file(f): > 11 212.1 MiB 50.4 MiB 1 table = pq.read_table(f) > 12 258.2 MiB 46.1 MiB 1 df = > table.to_pandas(strings_to_categorical=True) > 13 258.2 MiB 0.0 MiB 1 del table > 14 256.3 MiB -1.9 MiB 1 del df > Filename: C:/run_pyarrow_memoy_leak_sample.py > Line # Mem usage Increment Occurences Line Contents > ============================================================ > 9 256.3 MiB 256.3 MiB 1 @profile > 10 def read_file(f): > 11 279.2 MiB 23.0 MiB 1 table = pq.read_table(f) > 12 322.2 MiB 43.0 MiB 1 df = > table.to_pandas(strings_to_categorical=True) > 13 322.2 MiB 0.0 MiB 1 del table > 14 320.3 MiB -1.9 MiB 1 del df > Filename: C:/run_pyarrow_memoy_leak_sample.py > Line # Mem usage Increment Occurences Line Contents > ============================================================ > 9 320.3 MiB 320.3 MiB 1 @profile > 10 def read_file(f): > 11 326.9 MiB 6.5 MiB 1 table = pq.read_table(f) > 12 361.7 MiB 34.8 MiB 1 df = > table.to_pandas(strings_to_categorical=True) > 13 361.7 MiB 0.0 MiB 1 del table > 14 359.8 MiB -1.9 MiB 1 del df > {code} > *Python 3.5.6 (conda), pyarrow 0.12.1 (pip), pandas 0.24.1 (pip) Logs* > {code:java} > Filename: C:/run_pyarrow_memoy_leak_sample.py > Line # Mem usage Increment Occurences Line Contents > ============================================================ > 9 138.4 MiB 138.4 MiB 1 @profile > 10 def read_file(f): > 11 186.2 MiB 47.8 MiB 1 table = pq.read_table(f) > 12 219.2 MiB 33.0 MiB 1 df = > table.to_pandas(strings_to_categorical=True) > 13 171.7 MiB -47.5 MiB 1 del table > 14 139.3 MiB -32.4 MiB 1 del df > Filename: C:/run_pyarrow_memoy_leak_sample.py > Line # Mem usage Increment Occurences Line Contents > ============================================================ > 9 139.3 MiB 139.3 MiB 1 @profile > 10 def read_file(f): > 11 186.8 MiB 47.5 MiB 1 table = pq.read_table(f) > 12 219.2 MiB 32.4 MiB 1 df = > table.to_pandas(strings_to_categorical=True) > 13 171.5 MiB -47.7 MiB 1 del table > 14 139.1 MiB -32.4 MiB 1 del df > Filename: C:/run_pyarrow_memoy_leak_sample.py > Line # Mem usage Increment Occurences Line Contents > ============================================================ > 9 139.1 MiB 139.1 MiB 1 @profile > 10 def read_file(f): > 11 186.8 MiB 47.7 MiB 1 table = pq.read_table(f) > 12 219.2 MiB 32.4 MiB 1 df = > table.to_pandas(strings_to_categorical=True) > 13 171.8 MiB -47.5 MiB 1 del table > 14 139.3 MiB -32.4 MiB 1 del df > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)