Re: [I] [DISCUSS] Decouple IO and CPU operations in the Parquet Reader (push decoder?) [arrow-rs]

via GitHub Thu, 07 Aug 2025 11:51:54 -0700


alamb commented on issue #7983:
URL: https://github.com/apache/arrow-rs/issues/7983#issuecomment-3165363332


   > I'm not sure I understand why this model isn't possible with the 
pull-based reader? I could implement an 
[AsyncFileReader](https://docs.rs/parquet/latest/parquet/arrow/async_reader/trait.AsyncFileReader.html)
 ...
   
   The thing you can't do with current parquet pull decoder is known what IO 
requests will be coming *next* -- so basically when the pull decoder asks you 
for more data, it needs the bytes to make any more progress -- and thus your 
decoding stalls until you feed the bytes in
   
   To have effective pre-fetching, you need to know what ranges are going to be 
needed *before* the reader needs them
   
   So in the arrow-rs parquet case, for example, this might mean as you are 
reading one row group, calculate the ranges to fetch from object store for the 
*next* row group. Right now, the decoder won't tell you this information until 
it actually tries to read the next row group
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [DISCUSS] Decouple IO and CPU operations in the Parquet Reader (push decoder?) [arrow-rs]

Reply via email to