Re: [I] [C++][Parquet] Iterating over Parquet RecordBatchReader uses memory equivalent to whole file size [arrow]

via GitHub Tue, 01 Jul 2025 21:16:55 -0700


adamreeve commented on issue #46935:
URL: https://github.com/apache/arrow/issues/46935#issuecomment-3026355421


   After looking at this more closely, I see that read ranges are 
[coalesced](https://github.com/apache/arrow/blob/3b3684bb7d400b1f93d9aa17ff8f6c98641abea4/cpp/src/arrow/io/caching.cc#L178)
 before being stored in the ReadRangeCache, and reads usually only use a 
[slice](https://github.com/apache/arrow/blob/3b3684bb7d400b1f93d9aa17ff8f6c98641abea4/cpp/src/arrow/io/caching.cc#L227)
 of a cached range.
   
   This probably makes it very difficult to track when a read buffer is no 
longer needed.
   
   Perhaps there should just be a big warning in the documentation of the 
pre-buffer parameter that this will require storing all previously read file 
data in memory. I'm a little surprised this is the default behaviour, but maybe 
most users are reading small Parquet files from S3 rather than large Parquet 
files on fast file systems.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [C++][Parquet] Iterating over Parquet RecordBatchReader uses memory equivalent to whole file size [arrow]

Reply via email to