Re: [I] [Epic] Parquet Reader Improvement Plan / Proposal - July 2025 [arrow-rs]

via GitHub Mon, 04 Aug 2025 04:04:57 -0700


steveloughran commented on issue #8000:
URL: https://github.com/apache/arrow-rs/issues/8000#issuecomment-3150137812


   I'm watching this. In java, the big speedups we've got in reading data from 
cloud storage come from parallel GETs of rowgroups, either explicitly or based 
on inference of read patterns and knowledge of file structure. We've seen 
speedups of about 30% in read-heavy TPC queries -the more data read, the better 
the improvement.
   
   These parallel reads are critical to compensating for the latency of cloud 
storage reads.
   
   This means what is being designed has to be able to support this. Ideally 
the different rowgroups should be specified up front and whichever range comes 
in first is pushed to the processing queue, even if is not the first in the 
list. We don't have that in the parquet-java stack, which still has to wait for 
all the ranges.
   
   What I'd recommend then is
   1. determine ranges to retrieve, with as much filtering as is possible; 
ignoring LIMITs on result sizes
   2. Request initial set of rowgroup
   3. Out of Order processing of results
   4. keep scheduling more ranges, unless processing unit cancels read process 
due to satisifed results (LIMIT/SAMPLE etc) or some error. 
   
   oh, and collect lots of metrics. data read, data discarded, pipeline stalls 
waiting for new data in particular. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Epic] Parquet Reader Improvement Plan / Proposal - July 2025 [arrow-rs]

Reply via email to