steveloughran commented on issue #8000: URL: https://github.com/apache/arrow-rs/issues/8000#issuecomment-3150137812
I'm watching this. In java, the big speedups we've got in reading data from cloud storage come from parallel GETs of rowgroups, either explicitly or based on inference of read patterns and knowledge of file structure. We've seen speedups of about 30% in read-heavy TPC queries -the more data read, the better the improvement. These parallel reads are critical to compensating for the latency of cloud storage reads. This means what is being designed has to be able to support this. Ideally the different rowgroups should be specified up front and whichever range comes in first is pushed to the processing queue, even if is not the first in the list. We don't have that in the parquet-java stack, which still has to wait for all the ranges. What I'd recommend then is 1. determine ranges to retrieve, with as much filtering as is possible; ignoring LIMITs on result sizes 2. Request initial set of rowgroup 3. Out of Order processing of results 4. keep scheduling more ranges, unless processing unit cancels read process due to satisifed results (LIMIT/SAMPLE etc) or some error. oh, and collect lots of metrics. data read, data discarded, pipeline stalls waiting for new data in particular. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
