wingkitlee0 commented on issue #39808:
URL: https://github.com/apache/arrow/issues/39808#issuecomment-2954260310
Sharing a plot that I made a little while ago, using pyarrow 20.x and the
following options in `to_batches` (or `scanner`):
```
batch_readahead=0,
cache_metadata=False, # new
fragment_scan_options=pyarrow.dataset.ParquetFragmentScanOptions(
use_buffered_stream=True,
pre_buffer=False,
cache_options=pa.CacheOptions(lazy=True, prefetch_limit=0),
),
```

I used the 2.3gb parquet file earlier in the thread, which has about 10 row
groups. In the figure, the blue lines are default option, orange/green are
using `cache_metadata=False` etc. `b5000` etc are the batch size. The top panel
shows the memory used by the current batch. The middle and bottom panels are
pa.total_allocated_bytes and rss (from psutil), respectively.
There are 9-10 spikes, which seem to be at the beginning of each row group.
The memory usage is still rising with the extra options, though much slower.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]