Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

Wes McKinney Thu, 06 Feb 2020 12:03:18 -0800

On Thu, Feb 6, 2020 at 1:30 PM Antoine Pitrou <[email protected]> wrote:
>
>
> Le 06/02/2020 à 20:20, Wes McKinney a écrit :
> >> Actually, on a more high-level basis, is the goal to prefetch for
> >> sequential consumption of row groups?
> >>
> >
> > Essentially yes. One "easy" optimization is to prefetch the entire
> > serialized row group. This is an evolution of that idea where we want to
> > prefetch only the needed parts of a row group in a minimum number of IO
> > calls (consider reading the first 10 columns from a file with 1000 columns
> > -- so we want to do one IO call instead of 10 like we do now).
>
> There are no situations where you would want to consume a scattered
> subset of row groups (e.g. predicate pushdown)?


There are. If it can be demonstrated that there are performance gains
resulting from IO optimizations involving multiple row groups then I
see no reason not to implement them.

> Regards
>
> Antoine.

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

Reply via email to