Michael Peleshenko created ARROW-11007: ------------------------------------------
Summary: [Python] Memory leak in pq.read_table and table.to_pandas Key: ARROW-11007 URL: https://issues.apache.org/jira/browse/ARROW-11007 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 2.0.0 Reporter: Michael Peleshenko While upgrading our application to use pyarrow 2.0.0 instead of 0.12.1, we observed a memory leak in the read_table and to_pandas methods. See below for sample code to reproduce it. Memory does not seem to be returned after deleting the table and df as it was in pyarrow 0.12.1. *Sample Code* {code:python} import io import pandas as pd import pyarrow as pa import pyarrow.parquet as pq from memory_profiler import profile @profile def read_file(f): table = pq.read_table(f) df = table.to_pandas(strings_to_categorical=True) del table del df def main(): rows = 2000000 df = pd.DataFrame({ "string": ["test"] * rows, "int": [5] * rows, "float": [2.0] * rows, }) table = pa.Table.from_pandas(df, preserve_index=False) parquet_stream = io.BytesIO() pq.write_table(table, parquet_stream) for i in range(3): parquet_stream.seek(0) read_file(parquet_stream) if __name__ == '__main__': main() {code} *Python 3.8.5 (conda), pyarrow 2.0.0 (pip), pandas 1.1.2 (pip) Logs* {code:java} Filename: C:/run_pyarrow_memoy_leak_sample.py Line # Mem usage Increment Occurences Line Contents ============================================================ 9 161.7 MiB 161.7 MiB 1 @profile 10 def read_file(f): 11 212.1 MiB 50.4 MiB 1 table = pq.read_table(f) 12 258.2 MiB 46.1 MiB 1 df = table.to_pandas(strings_to_categorical=True) 13 258.2 MiB 0.0 MiB 1 del table 14 256.3 MiB -1.9 MiB 1 del df Filename: C:/run_pyarrow_memoy_leak_sample.py Line # Mem usage Increment Occurences Line Contents ============================================================ 9 256.3 MiB 256.3 MiB 1 @profile 10 def read_file(f): 11 279.2 MiB 23.0 MiB 1 table = pq.read_table(f) 12 322.2 MiB 43.0 MiB 1 df = table.to_pandas(strings_to_categorical=True) 13 322.2 MiB 0.0 MiB 1 del table 14 320.3 MiB -1.9 MiB 1 del df Filename: C:/run_pyarrow_memoy_leak_sample.py Line # Mem usage Increment Occurences Line Contents ============================================================ 9 320.3 MiB 320.3 MiB 1 @profile 10 def read_file(f): 11 326.9 MiB 6.5 MiB 1 table = pq.read_table(f) 12 361.7 MiB 34.8 MiB 1 df = table.to_pandas(strings_to_categorical=True) 13 361.7 MiB 0.0 MiB 1 del table 14 359.8 MiB -1.9 MiB 1 del df {code} *Python 3.5.6 (conda), pyarrow 0.12.1 (pip), pandas 0.24.1 (pip) Logs* {code:java} Filename: C:/run_pyarrow_memoy_leak_sample.py Line # Mem usage Increment Occurences Line Contents ============================================================ 9 138.4 MiB 138.4 MiB 1 @profile 10 def read_file(f): 11 186.2 MiB 47.8 MiB 1 table = pq.read_table(f) 12 219.2 MiB 33.0 MiB 1 df = table.to_pandas(strings_to_categorical=True) 13 171.7 MiB -47.5 MiB 1 del table 14 139.3 MiB -32.4 MiB 1 del df Filename: C:/run_pyarrow_memoy_leak_sample.py Line # Mem usage Increment Occurences Line Contents ============================================================ 9 139.3 MiB 139.3 MiB 1 @profile 10 def read_file(f): 11 186.8 MiB 47.5 MiB 1 table = pq.read_table(f) 12 219.2 MiB 32.4 MiB 1 df = table.to_pandas(strings_to_categorical=True) 13 171.5 MiB -47.7 MiB 1 del table 14 139.1 MiB -32.4 MiB 1 del df Filename: C:/run_pyarrow_memoy_leak_sample.py Line # Mem usage Increment Occurences Line Contents ============================================================ 9 139.1 MiB 139.1 MiB 1 @profile 10 def read_file(f): 11 186.8 MiB 47.7 MiB 1 table = pq.read_table(f) 12 219.2 MiB 32.4 MiB 1 df = table.to_pandas(strings_to_categorical=True) 13 171.8 MiB -47.5 MiB 1 del table 14 139.3 MiB -32.4 MiB 1 del df {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)