adamreeve commented on issue #46935:
URL: https://github.com/apache/arrow/issues/46935#issuecomment-3026355421

   After looking at this more closely, I see that read ranges are 
[coalesced](https://github.com/apache/arrow/blob/3b3684bb7d400b1f93d9aa17ff8f6c98641abea4/cpp/src/arrow/io/caching.cc#L178)
 before being stored in the ReadRangeCache, and reads usually only use a 
[slice](https://github.com/apache/arrow/blob/3b3684bb7d400b1f93d9aa17ff8f6c98641abea4/cpp/src/arrow/io/caching.cc#L227)
 of a cached range.
   
   This probably makes it very difficult to track when a read buffer is no 
longer needed.
   
   Perhaps there should just be a big warning in the documentation of the 
pre-buffer parameter that this will require storing all previously read file 
data in memory. I'm a little surprised this is the default behaviour, but maybe 
most users are reading small Parquet files from S3 rather than large Parquet 
files on fast file systems.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to