[
https://issues.apache.org/jira/browse/ARROW-14524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17436202#comment-17436202
]
David Li commented on ARROW-14524:
----------------------------------
This would be less effort, but wouldn't be technically quite as optimal
compared to taking advantage of what we know about the file format. For
instance IPC is like Parquet and we should be able to know all the ranges to be
read up front. What's proposed here would possibly help for CSV though, and I
have no idea for ORC.
That said, we could extend the interface of RandomAccessFile or extend the
contract of WillNeed to allow for both such use cases in the same interface.
I also think we would need to do some refactoring in general to take advantage
of this anyways - else if we just start batching requests, we might
unintentionally block something (or conversely, we won't ever have anything to
batch since the reader is just reading one block at a time).
One thing that we should do regardless though is update the ReadRangeCache to
be able to discard ranges to save on memory, particularly for the use case we
have in Datasets where we're just sequentially scanning (a subset of) a file.
> [C++] Create plugging/coalescing filesystem wrapper
> ---------------------------------------------------
>
> Key: ARROW-14524
> URL: https://issues.apache.org/jira/browse/ARROW-14524
> Project: Apache Arrow
> Issue Type: New Feature
> Components: C++
> Reporter: Weston Pace
> Assignee: Weston Pace
> Priority: Major
>
> We have I/O optimizations scattered across some of our readers. The most
> prominent example is prebuffering in the parquet reader. However, these
> techniques are rather general purpose and will apply in IPC (see ARROW-14229)
> as well as other readers (e.g. Orc, maybe even CSV)
> This filesystem wrapper will not generally be necessary for local filesystems
> as the OS' filesystem schedulers are sufficient. Most of these we can
> accomplish by simply aiming for some configurable degree of parallelism (e.g.
> if there are already X requests in progress then start batching).
> Goals:
> * Batch consecutive small requests into fewer large requests
> * We could plug (configurably) small holes in read ranges as well
> * Potentially split large requests into concurrent small requests
> * Support for the RandomAccessFile::WillNeed command by prefetching ranges
--
This message was sent by Atlassian Jira
(v8.3.4#803005)