[ 
https://issues.apache.org/jira/browse/ARROW-14524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17436206#comment-17436206
 ] 

Weston Pace edited comment on ARROW-14524 at 10/29/21, 11:55 PM:
-----------------------------------------------------------------

I agree work will still be needed on the readers side.  We will need to add 
something like ReadRangeCache::Cache to RandomAccessFile (which should be a 
natural extension of WillNeed).

{quote}
else if we just start batching requests, we might unintentionally block 
something (or conversely, we won't ever have anything to batch since the reader 
is just reading one block at a time).
{quote}

Correct, if a reader is using synchronous I/O then they probably wouldn't get 
any benefit.  Because they won't call Read(10, 10) until the call to Read(0, 
10) has finished and so nothing will ever have a chance to batch up.  Readers 
would either need to do two passes (first, a cache pass that will call 
WillNeed/Cache on all required ranges.  Then, a read pass that actual does the 
reads) or they would need to do asynchronous readahead (similar to the way the 
CSV reader operates) so that they call ReadAsync(0, 10) and then continue on, 
calling ReadAsync(10, 10) before the first call finishes.


was (Author: westonpace):
I agree work will still be needed on the readers side.  We will need to add 
something like ReadRangeCache::Cache to RandomAccessFile (which should be a 
natural extension of WillNeed).

> else if we just start batching requests, we might unintentionally block 
> something (or conversely, we won't ever have anything to batch since the 
> reader is just reading one block at a time).

Correct, if a reader is using synchronous I/O then they probably wouldn't get 
any benefit.  Because they won't call Read(10, 10) until the call to Read(0, 
10) has finished and so nothing will ever have a chance to batch up.  Readers 
would either need to do two passes (first, a cache pass that will call 
WillNeed/Cache on all required ranges.  Then, a read pass that actual does the 
reads) or they would need to do asynchronous readahead (similar to the way the 
CSV reader operates) so that they call ReadAsync(0, 10) and then continue on, 
calling ReadAsync(10, 10) before the first call finishes.

> [C++] Create plugging/coalescing filesystem wrapper
> ---------------------------------------------------
>
>                 Key: ARROW-14524
>                 URL: https://issues.apache.org/jira/browse/ARROW-14524
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Weston Pace
>            Assignee: Weston Pace
>            Priority: Major
>
> We have I/O optimizations scattered across some of our readers.  The most 
> prominent example is prebuffering in the parquet reader.  However, these 
> techniques are rather general purpose and will apply in IPC (see ARROW-14229) 
> as well as other readers (e.g. Orc, maybe even CSV)
> This filesystem wrapper will not generally be necessary for local filesystems 
> as the OS' filesystem schedulers are sufficient.  Most of these we can 
> accomplish by simply aiming for some configurable degree of parallelism (e.g. 
> if there are already X requests in progress then start batching).
> Goals:
>  * Batch consecutive small requests into fewer large requests
>    * We could plug (configurably) small holes in read ranges as well
>  * Potentially split large requests into concurrent small requests
>  * Support for the RandomAccessFile::WillNeed command by prefetching ranges



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to