[GitHub] [arrow] westonpace edited a comment on pull request #9482: ARROW-11601: [C++][Python][Dataset] expose Parquet pre-buffer option

GitBox Thu, 18 Feb 2021 13:01:37 -0800


westonpace edited a comment on pull request #9482:
URL: https://github.com/apache/arrow/pull/9482#issuecomment-781627186



   Ok, I think I get it now.  Let's pretend ~~we are outside of S3,~~ (nvm, S3 
not relevant) there is 1 file, 3 row groups, and 3 columns.  Three scan tasks 
will be generated.  Task 1 needs RG1C1, RG1C2, RG1C3.  Task 2 needs RG2C1, 
RG2C2, RG2C3.  Task 3 needs RG3C1, RG3C2, RG3C3.
   
   Prebuffer will be called asking for all 9 blocks and it will then issue 
three reads in parallel (instead of the 9 reads that would otherwise be issued) 
RG1, RG2, and RG3.
   
   If there are only 3 columns in the file, is it possible Prebuffer would 
coalesce this all into one read?  In that case wouldn't all three tasks be 
blocked until the entire file is read, preventing the ability for task 1 to 
start running as soon as RG1 is issued?
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace edited a comment on pull request #9482: ARROW-11601: [C++][Python][Dataset] expose Parquet pre-buffer option

Reply via email to