[ https://issues.apache.org/jira/browse/ARROW-11007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17294938#comment-17294938 ]
shadowdsp commented on ARROW-11007: ----------------------------------- [~westonpace] thank you very much! > [Python] Memory leak in pq.read_table and table.to_pandas > --------------------------------------------------------- > > Key: ARROW-11007 > URL: https://issues.apache.org/jira/browse/ARROW-11007 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 2.0.0 > Reporter: Michael Peleshenko > Priority: Major > > While upgrading our application to use pyarrow 2.0.0 instead of 0.12.1, we > observed a memory leak in the read_table and to_pandas methods. See below for > sample code to reproduce it. Memory does not seem to be returned after > deleting the table and df as it was in pyarrow 0.12.1. > *Sample Code* > {code:python} > import io > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > from memory_profiler import profile > @profile > def read_file(f): > table = pq.read_table(f) > df = table.to_pandas(strings_to_categorical=True) > del table > del df > def main(): > rows = 2000000 > df = pd.DataFrame({ > "string": ["test"] * rows, > "int": [5] * rows, > "float": [2.0] * rows, > }) > table = pa.Table.from_pandas(df, preserve_index=False) > parquet_stream = io.BytesIO() > pq.write_table(table, parquet_stream) > for i in range(3): > parquet_stream.seek(0) > read_file(parquet_stream) > if __name__ == '__main__': > main() > {code} > *Python 3.8.5 (conda), pyarrow 2.0.0 (pip), pandas 1.1.2 (pip) Logs* > {code:java} > Filename: C:/run_pyarrow_memoy_leak_sample.py > Line # Mem usage Increment Occurences Line Contents > ============================================================ > 9 161.7 MiB 161.7 MiB 1 @profile > 10 def read_file(f): > 11 212.1 MiB 50.4 MiB 1 table = pq.read_table(f) > 12 258.2 MiB 46.1 MiB 1 df = > table.to_pandas(strings_to_categorical=True) > 13 258.2 MiB 0.0 MiB 1 del table > 14 256.3 MiB -1.9 MiB 1 del df > Filename: C:/run_pyarrow_memoy_leak_sample.py > Line # Mem usage Increment Occurences Line Contents > ============================================================ > 9 256.3 MiB 256.3 MiB 1 @profile > 10 def read_file(f): > 11 279.2 MiB 23.0 MiB 1 table = pq.read_table(f) > 12 322.2 MiB 43.0 MiB 1 df = > table.to_pandas(strings_to_categorical=True) > 13 322.2 MiB 0.0 MiB 1 del table > 14 320.3 MiB -1.9 MiB 1 del df > Filename: C:/run_pyarrow_memoy_leak_sample.py > Line # Mem usage Increment Occurences Line Contents > ============================================================ > 9 320.3 MiB 320.3 MiB 1 @profile > 10 def read_file(f): > 11 326.9 MiB 6.5 MiB 1 table = pq.read_table(f) > 12 361.7 MiB 34.8 MiB 1 df = > table.to_pandas(strings_to_categorical=True) > 13 361.7 MiB 0.0 MiB 1 del table > 14 359.8 MiB -1.9 MiB 1 del df > {code} > *Python 3.5.6 (conda), pyarrow 0.12.1 (pip), pandas 0.24.1 (pip) Logs* > {code:java} > Filename: C:/run_pyarrow_memoy_leak_sample.py > Line # Mem usage Increment Occurences Line Contents > ============================================================ > 9 138.4 MiB 138.4 MiB 1 @profile > 10 def read_file(f): > 11 186.2 MiB 47.8 MiB 1 table = pq.read_table(f) > 12 219.2 MiB 33.0 MiB 1 df = > table.to_pandas(strings_to_categorical=True) > 13 171.7 MiB -47.5 MiB 1 del table > 14 139.3 MiB -32.4 MiB 1 del df > Filename: C:/run_pyarrow_memoy_leak_sample.py > Line # Mem usage Increment Occurences Line Contents > ============================================================ > 9 139.3 MiB 139.3 MiB 1 @profile > 10 def read_file(f): > 11 186.8 MiB 47.5 MiB 1 table = pq.read_table(f) > 12 219.2 MiB 32.4 MiB 1 df = > table.to_pandas(strings_to_categorical=True) > 13 171.5 MiB -47.7 MiB 1 del table > 14 139.1 MiB -32.4 MiB 1 del df > Filename: C:/run_pyarrow_memoy_leak_sample.py > Line # Mem usage Increment Occurences Line Contents > ============================================================ > 9 139.1 MiB 139.1 MiB 1 @profile > 10 def read_file(f): > 11 186.8 MiB 47.7 MiB 1 table = pq.read_table(f) > 12 219.2 MiB 32.4 MiB 1 df = > table.to_pandas(strings_to_categorical=True) > 13 171.8 MiB -47.5 MiB 1 del table > 14 139.3 MiB -32.4 MiB 1 del df > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)