[ https://issues.apache.org/jira/browse/ARROW-17441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580471#comment-17580471 ]
Will Jones commented on ARROW-17441: ------------------------------------ {quote}I must admit I don't understand the references to compression in your comments. Were you planning to use Parquet at some point?{quote} Sorry, I was testing memory usage from Parquet reads and seeing something like this, but decided to take Parquet out of the picture to simplify. {quote}Other than that, Numpy-allocated memory does not use the Arrow memory pool, so I'm not sure those stats are very indicative.{quote} Ah I think you are likely right there. > [Python] Memory kept after del and pool.released_unused() > --------------------------------------------------------- > > Key: ARROW-17441 > URL: https://issues.apache.org/jira/browse/ARROW-17441 > Project: Apache Arrow > Issue Type: Improvement > Components: Python > Affects Versions: 9.0.0 > Reporter: Will Jones > Priority: Major > > I was trying reproduce another issue involving memory pools not releasing > memory, but encountered this confusing behavior: if I create a table, then > call {{{}del table{}}}, and then {{{}pool.release_unused(){}}}, I still see > significant memory usage. On mimalloc in particular, I see no meaningful drop > in memory usage on either call. > Am I missing something? My understanding prior has been that memory will be > held onto by a memory pool, but will be forced free by release_unused; and > that system memory pool should release memory immediately. But neither of > those seem true. > {code:python} > import os > import psutil > import time > import gc > process = psutil.Process(os.getpid()) > import numpy as np > from uuid import uuid4 > import pyarrow as pa > def gen_batches(n_groups=200, rows_per_group=200_000): > for _ in range(n_groups): > id_val = uuid4().bytes > yield pa.table({ > "x": np.random.random(rows_per_group), # This will compress poorly > "y": np.random.random(rows_per_group), > "a": pa.array(list(range(rows_per_group)), type=pa.int32()), # > This compresses with delta encoding > "id": pa.array([id_val] * rows_per_group), # This compresses with > RLE > }) > def print_rss(): > print(f"RSS: {process.memory_info().rss:,} bytes") > print(f"memory_pool={pa.default_memory_pool().backend_name}") > print_rss() > print("reading table") > tab = pa.concat_tables(list(gen_batches())) > print_rss() > print("deleting table") > del tab > gc.collect() > print_rss() > print("releasing unused memory") > pa.default_memory_pool().release_unused() > print_rss() > print("waiting 10 seconds") > time.sleep(10) > print_rss() > {code} > {code:none} > > ARROW_DEFAULT_MEMORY_POOL=mimalloc python test_pool.py && \ > ARROW_DEFAULT_MEMORY_POOL=jemalloc python test_pool.py && \ > ARROW_DEFAULT_MEMORY_POOL=system python test_pool.py > memory_pool=mimalloc > RSS: 44,449,792 bytes > reading table > RSS: 1,819,557,888 bytes > deleting table > RSS: 1,819,590,656 bytes > releasing unused memory > RSS: 1,819,852,800 bytes > waiting 10 seconds > RSS: 1,819,852,800 bytes > memory_pool=jemalloc > RSS: 45,629,440 bytes > reading table > RSS: 1,668,677,632 bytes > deleting table > RSS: 698,400,768 bytes > releasing unused memory > RSS: 699,023,360 bytes > waiting 10 seconds > RSS: 699,023,360 bytes > memory_pool=system > RSS: 44,875,776 bytes > reading table > RSS: 1,713,569,792 bytes > deleting table > RSS: 540,311,552 bytes > releasing unused memory > RSS: 540,311,552 bytes > waiting 10 seconds > RSS: 540,311,552 bytes > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)