[ 
https://issues.apache.org/jira/browse/ARROW-14524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17436202#comment-17436202
 ] 

David Li commented on ARROW-14524:
----------------------------------

This would be less effort, but wouldn't be technically quite as optimal 
compared to taking advantage of what we know about the file format. For 
instance IPC is like Parquet and we should be able to know all the ranges to be 
read up front. What's proposed here would possibly help for CSV though, and I 
have no idea for ORC.

That said, we could extend the interface of RandomAccessFile or extend the 
contract of WillNeed to allow for both such use cases in the same interface.

I also think we would need to do some refactoring in general to take advantage 
of this anyways - else if we just start batching requests, we might 
unintentionally block something (or conversely, we won't ever have anything to 
batch since the reader is just reading one block at a time).

One thing that we should do regardless though is update the ReadRangeCache to 
be able to discard ranges to save on memory, particularly for the use case we 
have in Datasets where we're just sequentially scanning (a subset of) a file.

> [C++] Create plugging/coalescing filesystem wrapper
> ---------------------------------------------------
>
>                 Key: ARROW-14524
>                 URL: https://issues.apache.org/jira/browse/ARROW-14524
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Weston Pace
>            Assignee: Weston Pace
>            Priority: Major
>
> We have I/O optimizations scattered across some of our readers.  The most 
> prominent example is prebuffering in the parquet reader.  However, these 
> techniques are rather general purpose and will apply in IPC (see ARROW-14229) 
> as well as other readers (e.g. Orc, maybe even CSV)
> This filesystem wrapper will not generally be necessary for local filesystems 
> as the OS' filesystem schedulers are sufficient.  Most of these we can 
> accomplish by simply aiming for some configurable degree of parallelism (e.g. 
> if there are already X requests in progress then start batching).
> Goals:
>  * Batch consecutive small requests into fewer large requests
>    * We could plug (configurably) small holes in read ranges as well
>  * Potentially split large requests into concurrent small requests
>  * Support for the RandomAccessFile::WillNeed command by prefetching ranges



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to