Re: [I] [C++][Parquet] Iterating over Parquet RecordBatchReader uses memory equivalent to whole file size [arrow]
adamreeve commented on issue #46935: URL: https://github.com/apache/arrow/issues/46935#issuecomment-3044187850 :+1: From my understanding I think the ref count approach should work, as the ranges that are coalesced should match the ranges requested later. This isn't an urgent priority for me as we've just disabled pre-buffering now, but I think this would be a nice improvement and I'll try to look at doing this soon, unless somebody else wants to 😄 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] [C++][Parquet] Iterating over Parquet RecordBatchReader uses memory equivalent to whole file size [arrow]
pitrou commented on issue #46935: URL: https://github.com/apache/arrow/issues/46935#issuecomment-3026843605 > After looking at this more closely, I see that read ranges are [coalesced](https://github.com/apache/arrow/blob/3b3684bb7d400b1f93d9aa17ff8f6c98641abea4/cpp/src/arrow/io/caching.cc#L178) before being stored in the ReadRangeCache, and reads usually only use a [slice](https://github.com/apache/arrow/blob/3b3684bb7d400b1f93d9aa17ff8f6c98641abea4/cpp/src/arrow/io/caching.cc#L227) of a cached range. > > This probably makes it very difficult to track when a read buffer is no longer needed. I wouldn't call it "very difficult", but it would require either: 1. use a simple ref count mechanism on each physical range (initialize it to the number of associated logical ranges, and decrement it each time a logical range is requested) 2. if that's too simplistic, track the association between logical ranges and physical ranges -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] [C++][Parquet] Iterating over Parquet RecordBatchReader uses memory equivalent to whole file size [arrow]
adamreeve commented on issue #46935: URL: https://github.com/apache/arrow/issues/46935#issuecomment-3026355421 After looking at this more closely, I see that read ranges are [coalesced](https://github.com/apache/arrow/blob/3b3684bb7d400b1f93d9aa17ff8f6c98641abea4/cpp/src/arrow/io/caching.cc#L178) before being stored in the ReadRangeCache, and reads usually only use a [slice](https://github.com/apache/arrow/blob/3b3684bb7d400b1f93d9aa17ff8f6c98641abea4/cpp/src/arrow/io/caching.cc#L227) of a cached range. This probably makes it very difficult to track when a read buffer is no longer needed. Perhaps there should just be a big warning in the documentation of the pre-buffer parameter that this will require storing all previously read file data in memory. I'm a little surprised this is the default behaviour, but maybe most users are reading small Parquet files from S3 rather than large Parquet files on fast file systems. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
