Re: [I] [C++][Parquet] Iterating over Parquet RecordBatchReader uses memory equivalent to whole file size [arrow]

2025-07-07 Thread via GitHub


adamreeve commented on issue #46935:
URL: https://github.com/apache/arrow/issues/46935#issuecomment-3044187850

   :+1: From my understanding I think the ref count approach should work, as 
the ranges that are coalesced should match the ranges requested later. This 
isn't an urgent priority for me as we've just disabled pre-buffering now, but I 
think this would be a nice improvement and I'll try to look at doing this soon, 
unless somebody else wants to 😄 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] [C++][Parquet] Iterating over Parquet RecordBatchReader uses memory equivalent to whole file size [arrow]

2025-07-02 Thread via GitHub


pitrou commented on issue #46935:
URL: https://github.com/apache/arrow/issues/46935#issuecomment-3026843605

   > After looking at this more closely, I see that read ranges are 
[coalesced](https://github.com/apache/arrow/blob/3b3684bb7d400b1f93d9aa17ff8f6c98641abea4/cpp/src/arrow/io/caching.cc#L178)
 before being stored in the ReadRangeCache, and reads usually only use a 
[slice](https://github.com/apache/arrow/blob/3b3684bb7d400b1f93d9aa17ff8f6c98641abea4/cpp/src/arrow/io/caching.cc#L227)
 of a cached range.
   > 
   > This probably makes it very difficult to track when a read buffer is no 
longer needed.
   
   I wouldn't call it "very difficult", but it would require either:
   1. use a simple ref count mechanism on each physical range (initialize it to 
the number of associated logical ranges, and decrement it each time a logical 
range is requested)
   2. if that's too simplistic, track the association between logical ranges 
and physical ranges
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] [C++][Parquet] Iterating over Parquet RecordBatchReader uses memory equivalent to whole file size [arrow]

2025-07-01 Thread via GitHub


adamreeve commented on issue #46935:
URL: https://github.com/apache/arrow/issues/46935#issuecomment-3026355421

   After looking at this more closely, I see that read ranges are 
[coalesced](https://github.com/apache/arrow/blob/3b3684bb7d400b1f93d9aa17ff8f6c98641abea4/cpp/src/arrow/io/caching.cc#L178)
 before being stored in the ReadRangeCache, and reads usually only use a 
[slice](https://github.com/apache/arrow/blob/3b3684bb7d400b1f93d9aa17ff8f6c98641abea4/cpp/src/arrow/io/caching.cc#L227)
 of a cached range.
   
   This probably makes it very difficult to track when a read buffer is no 
longer needed.
   
   Perhaps there should just be a big warning in the documentation of the 
pre-buffer parameter that this will require storing all previously read file 
data in memory. I'm a little surprised this is the default behaviour, but maybe 
most users are reading small Parquet files from S3 rather than large Parquet 
files on fast file systems.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]