In case folks are interested in how some other systems deal with IO management / scheduling, the comments in
https://github.com/apache/impala/blob/master/be/src/runtime/io/disk-io-mgr.h and related files might be interesting On Thu, Feb 6, 2020 at 9:26 AM Wes McKinney <wesmck...@gmail.com> wrote: > > On Thu, Feb 6, 2020 at 2:46 AM Antoine Pitrou <solip...@pitrou.net> wrote: > > > > On Wed, 5 Feb 2020 15:46:15 -0600 > > Wes McKinney <wesmck...@gmail.com> wrote: > > > > > > I'll comment in more detail on some of the other items in due course, > > > but I think this should be handled by an implementation of > > > RandomAccessFile (that wraps a naked RandomAccessFile) with some > > > additional methods, rather than adding this to the abstract > > > RandomAccessFile interface, e.g. > > > > > > class CachingInputFile : public RandomAccessFile { > > > public: > > > CachingInputFile(std::shared_ptr<RandomAccessFile> naked_file); > > > Status CacheRanges(...); > > > }; > > > > > > etc. > > > > IMHO it may be more beneficial to expose it as an asynchronous API on > > RandomAccessFile, for example: > > class RandomAccessFile { > > public: > > struct Range { > > int64_t offset; > > int64_t length; > > }; > > > > std::vector<Promise<std::shared_ptr<Buffer>>> > > ReadRangesAsync(std::vector<Range> ranges); > > }; > > > > > > The reason is that some APIs such as the C++ AWS S3 API have their own > > async support, which may be beneficial to use over a generic Arrow > > thread-pool implementation. > > > > Also, by returning a Promise instead of simply caching the results, you > > make it easier to handle the lifetime of the results. > > This seems useful, too. It becomes a question of where do you want to > manage the cached memory segments, however you obtain them. I'm > arguing that we should not have much custom code in the Parquet > library to manage the prefetched segments (and providing the correct > buffer slice to each column reader when they need it), and instead > encapsulate this logic so it can be reused. > > The API I proposed was just a mockup, I agree it would make sense for > the prefetching to occur asynchronously so that a column reader can > proceed as soon as its coalesced chunk has been prefetched, rather > than having to wait synchronously for all prefetching to complete. > > > > > (Promise<T> can be something like std::future<Result<T>>, though > > std::future<> has annoying limitations and we may want to write our own > > instead) > > > > Regards > > > > Antoine. > > > >