adamreeve commented on issue #46935: URL: https://github.com/apache/arrow/issues/46935#issuecomment-3026355421
After looking at this more closely, I see that read ranges are [coalesced](https://github.com/apache/arrow/blob/3b3684bb7d400b1f93d9aa17ff8f6c98641abea4/cpp/src/arrow/io/caching.cc#L178) before being stored in the ReadRangeCache, and reads usually only use a [slice](https://github.com/apache/arrow/blob/3b3684bb7d400b1f93d9aa17ff8f6c98641abea4/cpp/src/arrow/io/caching.cc#L227) of a cached range. This probably makes it very difficult to track when a read buffer is no longer needed. Perhaps there should just be a big warning in the documentation of the pre-buffer parameter that this will require storing all previously read file data in memory. I'm a little surprised this is the default behaviour, but maybe most users are reading small Parquet files from S3 rather than large Parquet files on fast file systems. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org