On Wed, Mar 18, 2020 at 11:42 AM Antoine Pitrou <anto...@python.org> wrote: > > > Le 18/03/2020 à 17:36, David Li a écrit : > > Hi all, > > > > Thanks to Antoine for implementing the core read coalescing logic. > > > > We've taken a look at what else needs to be done to get this working, > > and it sounds like the following changes would be worthwhile, > > independent of the rest of the optimizations we discussed: > > > > - Add benchmarks of the current Parquet reader with the current S3File > > (and other file implementations) so we can track > > improvements/regressions > > Instead of S3, you can use the Slow streams and Slow filesystem > implementations. It may better protect against varying external conditions. > > > - Use the coalescing inside the Parquet reader (even without a column > > filter hint - this would subsume PARQUET-1698) > > I'm assuming this would be done at the RowGroupReader level, right? > > > - In coalescing, split large read ranges into smaller ones (this would > > further improve on PARQUET-1698 by taking advantage of parallel reads) > > I don't understand what the "advantage" would be. Can you elaborate?
Empirically it is known to S3 users that parallelizing reads improves throughput. I think it has to do with the way that Amazon's infrastructure works. That's why it's important that we set ourselves up to do performance testing in a realistic environment in AWS rather than simulating it. > Regards > > Antoine.