[GitHub] [arrow] lidavidm commented on pull request #9482: ARROW-11601: [C++][Python][Dataset] expose Parquet pre-buffer option

GitBox Thu, 18 Feb 2021 13:08:54 -0800


lidavidm commented on pull request #9482:
URL: https://github.com/apache/arrow/pull/9482#issuecomment-781635820



   > Ok, I think I get it now. Let's pretend ~we are outside of S3,~ (nvm, S3 
not relevant) there is 1 file, 3 row groups, and 3 columns. Three scan tasks 
will be generated. Task 1 needs RG1C1, RG1C2, RG1C3. Task 2 needs RG2C1, RG2C2, 
RG2C3. Task 3 needs RG3C1, RG3C2, RG3C3.
   > 
   > Prebuffer will be called asking for all 9 blocks and it will then issue 
three reads in parallel (instead of the 9 reads that would otherwise be issued) 
RG1, RG2, and RG3.
   
   The number of reads that actually gets issued depends on the parameters but 
it would be anywhere from 1 to 9. (If you had a filesystem with high bandwidth 
but a very large time-to-first-byte, you'd issue one; if you had a filesystem 
with very low latency and high bandwidth, you'd issue all 9.)
   
   > If there are only 3 columns in the file, is it possible Prebuffer would 
coalesce this all into one read? In that case wouldn't all three tasks be 
blocked until the entire file is read, preventing the ability for task 1 to 
start running as soon as RG1 is issued?
   
   Yes, it's possible they'd all be coalesced into one read. You're trading off 
throughput for latency. The point is that for some filesystems, this can be 
faster than issuing separate reads. Picking numbers out of thin air, if it 
takes 10ms to establish a connection and 5ms to read one column, it's better 
making one request at (10ms + 3 * 5ms) than three requests at (10ms + 15ms) 
each (15ms since you're splitting the available bandwidth 3 ways).
   
   I guess one thing I should mention is that just because two ranges are 
adjacent, doesn't mean that PreBuffer will always coalesce them (and 
furthermore, just because two ranges are adjacent, doesn't mean that PreBuffer 
won't coalesce them anyways and pay the penalty to read the extra data between 
them). 
   
   Another is that the Parquet reader decodes one column at a time, and each 
column decoder will block until its chunk is read. So you're moving the 
blocking up front and hopefully consolidating it, instead of sequentially 
blocking several times.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] lidavidm commented on pull request #9482: ARROW-11601: [C++][Python][Dataset] expose Parquet pre-buffer option

Reply via email to