[ https://issues.apache.org/jira/browse/ARROW-11007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580959#comment-17580959 ]
Ninh Chu commented on ARROW-11007: ---------------------------------- Hi, I also encounter memory problem in v9.0.0. But in my case, the memory pool is scaled with dataset size, even I tried to limit batch size. Based on the document, RecordBatchReader is the safe way to read dataset big dataset. But in my case, if the memory scales with dataset size, it counters the purpose of Dataset and RecordBatchReader. {code:python} import pyarrow.dataset as ds import pyarrow as pa pa.jemalloc_set_decay_ms(0) delta_ds = ds.dataset("delta") row_count = delta_ds.count_rows() print("row_count = ", row_count) reader = delta_ds.scanner(batch_size=10000).to_reader() batch = reader.read_next_batch() print("first batch row count = ", batch.num_rows) print("Total allocated mem for pyarrow = ", pa.total_allocated_bytes() // 1024**2) {code} The results are interesting: Small dataset {code} dataset row_count = 66651 first batch row count = 10000 Total allocated mem for pyarrow = 103 {code} Big dataset created by duplicate file 4 times {code} dataset row_count = 333255 first batch row count = 10000 Total allocated mem for pyarrow = 412 {code} If load all the data in dataset into Table: {code:python} import pyarrow.dataset as ds import pyarrow as pa pa.jemalloc_set_decay_ms(0) delta_ds = ds.dataset("delta") row_count = delta_ds.count_rows() print("dataset row_count = ", row_count) pa_table = delta_ds.to_table() print("Total allocated mem for pyarrow = ", pa.total_allocated_bytes() // 1024**2) {code} {code} dataset row_count = 333255 Total allocated mem for pyarrow = 512 {code} > [Python] Memory leak in pq.read_table and table.to_pandas > --------------------------------------------------------- > > Key: ARROW-11007 > URL: https://issues.apache.org/jira/browse/ARROW-11007 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 2.0.0 > Reporter: Michael Peleshenko > Assignee: Weston Pace > Priority: Major > Attachments: Screenshot 2022-08-17 at 11.10.05.png, > benchmark-pandas-parquet.py > > > While upgrading our application to use pyarrow 2.0.0 instead of 0.12.1, we > observed a memory leak in the read_table and to_pandas methods. See below for > sample code to reproduce it. Memory does not seem to be returned after > deleting the table and df as it was in pyarrow 0.12.1. > *Sample Code* > {code:python} > import io > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > from memory_profiler import profile > @profile > def read_file(f): > table = pq.read_table(f) > df = table.to_pandas(strings_to_categorical=True) > del table > del df > def main(): > rows = 2000000 > df = pd.DataFrame({ > "string": ["test"] * rows, > "int": [5] * rows, > "float": [2.0] * rows, > }) > table = pa.Table.from_pandas(df, preserve_index=False) > parquet_stream = io.BytesIO() > pq.write_table(table, parquet_stream) > for i in range(3): > parquet_stream.seek(0) > read_file(parquet_stream) > if __name__ == '__main__': > main() > {code} > *Python 3.8.5 (conda), pyarrow 2.0.0 (pip), pandas 1.1.2 (pip) Logs* > {code:java} > Filename: C:/run_pyarrow_memoy_leak_sample.py > Line # Mem usage Increment Occurences Line Contents > ============================================================ > 9 161.7 MiB 161.7 MiB 1 @profile > 10 def read_file(f): > 11 212.1 MiB 50.4 MiB 1 table = pq.read_table(f) > 12 258.2 MiB 46.1 MiB 1 df = > table.to_pandas(strings_to_categorical=True) > 13 258.2 MiB 0.0 MiB 1 del table > 14 256.3 MiB -1.9 MiB 1 del df > Filename: C:/run_pyarrow_memoy_leak_sample.py > Line # Mem usage Increment Occurences Line Contents > ============================================================ > 9 256.3 MiB 256.3 MiB 1 @profile > 10 def read_file(f): > 11 279.2 MiB 23.0 MiB 1 table = pq.read_table(f) > 12 322.2 MiB 43.0 MiB 1 df = > table.to_pandas(strings_to_categorical=True) > 13 322.2 MiB 0.0 MiB 1 del table > 14 320.3 MiB -1.9 MiB 1 del df > Filename: C:/run_pyarrow_memoy_leak_sample.py > Line # Mem usage Increment Occurences Line Contents > ============================================================ > 9 320.3 MiB 320.3 MiB 1 @profile > 10 def read_file(f): > 11 326.9 MiB 6.5 MiB 1 table = pq.read_table(f) > 12 361.7 MiB 34.8 MiB 1 df = > table.to_pandas(strings_to_categorical=True) > 13 361.7 MiB 0.0 MiB 1 del table > 14 359.8 MiB -1.9 MiB 1 del df > {code} > *Python 3.5.6 (conda), pyarrow 0.12.1 (pip), pandas 0.24.1 (pip) Logs* > {code:java} > Filename: C:/run_pyarrow_memoy_leak_sample.py > Line # Mem usage Increment Occurences Line Contents > ============================================================ > 9 138.4 MiB 138.4 MiB 1 @profile > 10 def read_file(f): > 11 186.2 MiB 47.8 MiB 1 table = pq.read_table(f) > 12 219.2 MiB 33.0 MiB 1 df = > table.to_pandas(strings_to_categorical=True) > 13 171.7 MiB -47.5 MiB 1 del table > 14 139.3 MiB -32.4 MiB 1 del df > Filename: C:/run_pyarrow_memoy_leak_sample.py > Line # Mem usage Increment Occurences Line Contents > ============================================================ > 9 139.3 MiB 139.3 MiB 1 @profile > 10 def read_file(f): > 11 186.8 MiB 47.5 MiB 1 table = pq.read_table(f) > 12 219.2 MiB 32.4 MiB 1 df = > table.to_pandas(strings_to_categorical=True) > 13 171.5 MiB -47.7 MiB 1 del table > 14 139.1 MiB -32.4 MiB 1 del df > Filename: C:/run_pyarrow_memoy_leak_sample.py > Line # Mem usage Increment Occurences Line Contents > ============================================================ > 9 139.1 MiB 139.1 MiB 1 @profile > 10 def read_file(f): > 11 186.8 MiB 47.7 MiB 1 table = pq.read_table(f) > 12 219.2 MiB 32.4 MiB 1 df = > table.to_pandas(strings_to_categorical=True) > 13 171.8 MiB -47.5 MiB 1 del table > 14 139.3 MiB -32.4 MiB 1 del df > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)