Re: [I] [Python] Dataset.to_batches accumulates memory usage and leaks [arrow]

via GitHub Sun, 08 Jun 2025 13:20:46 -0700


wingkitlee0 commented on issue #39808:
URL: https://github.com/apache/arrow/issues/39808#issuecomment-2954260310


   Sharing a plot that I made a little while ago, using pyarrow 20.x and the 
following options in `to_batches` (or `scanner`):
   
   ```
   batch_readahead=0,
   cache_metadata=False,  # new
   fragment_scan_options=pyarrow.dataset.ParquetFragmentScanOptions(
     use_buffered_stream=True,
     pre_buffer=False,
     cache_options=pa.CacheOptions(lazy=True, prefetch_limit=0),
   ),
   ```
   
   
![Image](https://github.com/user-attachments/assets/9ee27d20-6c11-4c37-896e-1b0d3ce2b07c)
   
   I used the 2.3gb parquet file earlier in the thread, which has about 10 row 
groups. In the figure, the blue lines are default option, orange/green are 
using `cache_metadata=False` etc. `b5000` etc are the batch size. The top panel 
shows the memory used by the current batch. The middle and bottom panels are 
pa.total_allocated_bytes and rss (from psutil), respectively.
   
   There are 9-10 spikes, which seem to be at the beginning of each row group. 
The memory usage is still rising with the extra options, though much slower.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Python] Dataset.to_batches accumulates memory usage and leaks [arrow]

Reply via email to