Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

Wes McKinney Wed, 18 Mar 2020 09:49:15 -0700

On Wed, Mar 18, 2020 at 11:42 AM Antoine Pitrou <anto...@python.org> wrote:
>
>
> Le 18/03/2020 à 17:36, David Li a écrit :
> > Hi all,
> >
> > Thanks to Antoine for implementing the core read coalescing logic.
> >
> > We've taken a look at what else needs to be done to get this working,
> > and it sounds like the following changes would be worthwhile,
> > independent of the rest of the optimizations we discussed:
> >
> > - Add benchmarks of the current Parquet reader with the current S3File
> > (and other file implementations) so we can track
> > improvements/regressions
>
> Instead of S3, you can use the Slow streams and Slow filesystem
> implementations.  It may better protect against varying external conditions.
>
> > - Use the coalescing inside the Parquet reader (even without a column
> > filter hint - this would subsume PARQUET-1698)
>
> I'm assuming this would be done at the RowGroupReader level, right?
>
> > - In coalescing, split large read ranges into smaller ones (this would
> > further improve on PARQUET-1698 by taking advantage of parallel reads)
>
> I don't understand what the "advantage" would be.  Can you elaborate?


Empirically it is known to S3 users that parallelizing reads improves
throughput. I think it has to do with the way that Amazon's
infrastructure works. That's why it's important that we set ourselves
up to do performance testing in a realistic environment in AWS rather
than simulating it.

> Regards
>
> Antoine.

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

Reply via email to