Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

Wes McKinney Thu, 06 Feb 2020 11:21:19 -0800

On Thu, Feb 6, 2020, 12:42 PM Antoine Pitrou <anto...@python.org> wrote:


>
> Le 06/02/2020 à 19:40, Antoine Pitrou a écrit :
> >
> > Le 06/02/2020 à 19:37, Wes McKinney a écrit :
> >> On Thu, Feb 6, 2020, 12:12 PM Antoine Pitrou <anto...@python.org>
> wrote:
> >>
> >>> Le 06/02/2020 à 16:26, Wes McKinney a écrit :
> >>>>
> >>>> This seems useful, too. It becomes a question of where do you want to
> >>>> manage the cached memory segments, however you obtain them. I'm
> >>>> arguing that we should not have much custom code in the Parquet
> >>>> library to manage the prefetched segments (and providing the correct
> >>>> buffer slice to each column reader when they need it), and instead
> >>>> encapsulate this logic so it can be reused.
> >>>
> >>> I see, so RandomAccessFile would have some associative caching logic to
> >>> find whether the exact requested range was cached and then return it to
> >>> the caller?  That sounds doable.  How is lifetime handled then?  Are
> >>> cached buffers kept on the RandomAccessFile until they are requested,
> at
> >>> which point their ownership is transferred to the caller?
> >>>
> >>
> >> This seems like too much to try to build into RandomAccessFile. I would
> >> suggest a class that wraps a random access file and manages cached
> segments
> >> and their lifetimes through explicit APIs.
> >
> > So Parquet would expect to receive that class rather than
> > RandomAccessFile?  Or it would grow separate paths for it?
>
> Actually, on a more high-level basis, is the goal to prefetch for
> sequential consumption of row groups?
>

Essentially yes. One "easy" optimization is to prefetch the entire
serialized row group. This is an evolution of that idea where we want to
prefetch only the needed parts of a row group in a minimum number of IO
calls (consider reading the first 10 columns from a file with 1000 columns
-- so we want to do one IO call instead of 10 like we do now).



> Regards
>
> Antoine.
>

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

Reply via email to