Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

Wes McKinney Thu, 06 Feb 2020 08:08:37 -0800

In case folks are interested in how some other systems deal with IO
management / scheduling, the comments in


https://github.com/apache/impala/blob/master/be/src/runtime/io/disk-io-mgr.h

and related files might be interesting

On Thu, Feb 6, 2020 at 9:26 AM Wes McKinney <wesmck...@gmail.com> wrote:
>
> On Thu, Feb 6, 2020 at 2:46 AM Antoine Pitrou <solip...@pitrou.net> wrote:
> >
> > On Wed, 5 Feb 2020 15:46:15 -0600
> > Wes McKinney <wesmck...@gmail.com> wrote:
> > >
> > > I'll comment in more detail on some of the other items in due course,
> > > but I think this should be handled by an implementation of
> > > RandomAccessFile (that wraps a naked RandomAccessFile) with some
> > > additional methods, rather than adding this to the abstract
> > > RandomAccessFile interface, e.g.
> > >
> > > class CachingInputFile : public RandomAccessFile {
> > >  public:
> > >    CachingInputFile(std::shared_ptr<RandomAccessFile> naked_file);
> > >    Status CacheRanges(...);
> > > };
> > >
> > > etc.
> >
> > IMHO it may be more beneficial to expose it as an asynchronous API on
> > RandomAccessFile, for example:
> > class RandomAccessFile {
> >  public:
> >   struct Range {
> >     int64_t offset;
> >     int64_t length;
> >   };
> >
> >   std::vector<Promise<std::shared_ptr<Buffer>>>
> >     ReadRangesAsync(std::vector<Range> ranges);
> > };
> >
> >
> > The reason is that some APIs such as the C++ AWS S3 API have their own
> > async support, which may be beneficial to use over a generic Arrow
> > thread-pool implementation.
> >
> > Also, by returning a Promise instead of simply caching the results, you
> > make it easier to handle the lifetime of the results.
>
> This seems useful, too. It becomes a question of where do you want to
> manage the cached memory segments, however you obtain them. I'm
> arguing that we should not have much custom code in the Parquet
> library to manage the prefetched segments (and providing the correct
> buffer slice to each column reader when they need it), and instead
> encapsulate this logic so it can be reused.
>
> The API I proposed was just a mockup, I agree it would make sense for
> the prefetching to occur asynchronously so that a column reader can
> proceed as soon as its coalesced chunk has been prefetched, rather
> than having to wait synchronously for all prefetching to complete.
>
> >
> > (Promise<T> can be something like std::future<Result<T>>, though
> > std::future<> has annoying limitations and we may want to write our own
> > instead)
> >
> > Regards
> >
> > Antoine.
> >
> >

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

Reply via email to