twkim112 opened a new issue, #45504:
URL: https://github.com/apache/arrow/issues/45504
### Describe the bug, including details regarding any error messages,
version, and platform.
I’ve encountered a memory issue when reading Parquet files with Pandas using
the pyarrow engine. Even though pyarrow.total_allocated_bytes() reports that
allocated memory goes back to zero after each function call, the overall
process memory (as reported by psutil) keeps increasing significantly over
repeated calls.
Steps to Reproduce:
``` python
import psutil
import time
import pandas as pd
import gc
import pyarrow as pa
# pa.jemalloc_set_decay_ms(0)
def print_memory_usage():
process = psutil.Process()
mem_info = process.memory_info()
print(f"PA allocated_bytes after function call:
{pa.total_allocated_bytes() / 1024 / 1024:.2f} MB")
print(f"Memory Usage: {mem_info.rss / 1024 / 1024:.2f} MB")
def mem_and_time(func):
def wrapper(*args, **kwargs):
start_time = time.time()
result = func(*args, **kwargs)
end_time = time.time()
# Print results
print_memory_usage()
print(f"Execution Time: {end_time - start_time:.6f} seconds")
return result
return wrapper
@mem_and_time
def test_func_pandas():
# When using fastparquet, the memory usage is stable:
# df = pd.read_parquet("/Users/test.parquet", engine='fastparquet')
df = pd.read_parquet("/Users/test.parquet", engine='pyarrow')
print(f"PA allocated_bytes inside function call:
{pa.total_allocated_bytes() / 1024 / 1024:.2f} MB")
return None
if __name__ == "__main__":
for _ in range(10000):
test_func_pandas()
```
Observe that:
Inside each function call, pyarrow.total_allocated_bytes() reports a large
allocation (e.g., ~2646 MB).
After the function call, pyarrow.total_allocated_bytes() resets to 0 MB.
However, the overall process memory usage (as shown by psutil) increases
with each iteration.
```
PA allocated_bytes inside function call: 2646.27 MB
PA allocated_bytes after function call: 0.00 MB
Memory Usage: 3147.12 MB
Execution Time: 0.669164 seconds
PA allocated_bytes inside function call: 2646.27 MB
PA allocated_bytes after function call: 0.00 MB
Memory Usage: 3945.00 MB
Execution Time: 0.623360 seconds
PA allocated_bytes inside function call: 2646.27 MB
PA allocated_bytes after function call: 0.00 MB
Memory Usage: 4494.80 MB
Execution Time: 0.681895 seconds
PA allocated_bytes inside function call: 2646.27 MB
PA allocated_bytes after function call: 0.00 MB
Memory Usage: 4865.27 MB
Execution Time: 0.641056 seconds
...
PA allocated_bytes inside function call: 2646.27 MB
PA allocated_bytes after function call: 0.00 MB
Memory Usage: 6157.64 MB
Execution Time: 0.659480 seconds
```
> Environment:
>
> Python: 3.13.1
> Pandas: 2.2.3
> PyArrow: 19.0.0
> OS: MacOs Sequoia 15.3.1
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]