westonpace opened a new pull request #11616:
URL: https://github.com/apache/arrow/pull/11616


   **This is still very much a WIP**
   
   This PR attempts to address several issues:
   
    * Memory mapped IPC reads always call WillNeed on the data and the user has 
no way to avoid this
    * Projection pushdown is only available in the synchronous API
    * Coalescing / readahead is only available via the generators API
    * There is a lot of duplicate code in the generators path
    
    It adds two new methods to RecordBatchFileReader:
    
    ```
      /// \brief Begin loading metadata for the desired batches into memory.
     ///
     /// This method will also begin loading all dictionaries messages into 
memory.
     ///
     /// For a regular file this will immediately begin disk I/O in the 
background on a
     /// thread on the IOContext's thread pool.  If the file is memory mapped 
this will
     /// ensure the memory needed for the metadata is paged from disk into 
memory
     ///
     /// \param indices Indices of the batches to prefetch
     ///                If empty then all batches will be prefetched.
     virtual Status WillNeedMetadata(const std::vector<int>& indices) = 0;
   
     /// \brief Begin loading metadata for the desired batches into memory and 
indicate
     ///        that the data itself should be prefetched when it is requested
     ///
     /// This method should not be called in combination with WillNeedMetadata. 
 If you want
     /// to prefetch the data then use this method.  If you do not want to 
prefetch the data
     /// (because you are only accessing a small # of items in the batch's 
arrays) then you
     /// should use WillNeedMetadata
     ///
     /// This method will immediately start the I/O for the metadata and 
dictionaries.
     ///
     /// This method will not immediately start the I/O for the data.  The data 
I/O will be
     /// started when you call ReadRecordBatch.
     ///
     /// If you want to read multiple batches in parallel then you can make 
concurrent calls
     /// to ReadRecordBatch or ReadRecordBatchAsync
     /// \param indices
     /// \return
     virtual Status WillNeedBatches(const std::vector<int>& indices) = 0;
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to