Xiening Dai created ORC-262:
-------------------------------

             Summary: Support async prefetch in Orc reader
                 Key: ORC-262
                 URL: https://issues.apache.org/jira/browse/ORC-262
             Project: ORC
          Issue Type: Improvement
          Components: C++
            Reporter: Xiening Dai


Currently RowReader::next() method reads a batch of rows and return them to be 
processed by runtime. The function call is synchronized, meaning that the 
execution thread is blocked while reader is loading data from disk. We could 
potentially parallelize the execution and data loading through async prefetch 
using logic described as below.

In SeekableFileInputStream::Next(), we firstly check if the requested data 
block is already prefetched, if yes, we simply return the buffer to the caller, 
otherwise we issue a sync call to read data from file stream. No matter how we 
load the requested data block, we always issue another async call to prefetch 
the next block within current stream. 

Additionally orc::InputStream will need a new method that does the async read 
for a given offset and length.

According to our experiment, async prefetch can significantly reduce the IO 
wait time on a heavy loaded distributed file system. By carefully choosing the 
prefetch data block size, we can maximize the parallelization of runtime 
execution and data loading, and achieve a relatively high cache hit rate (~85%).




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to