Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

Antoine Pitrou Wed, 18 Mar 2020 09:43:14 -0700


Le 18/03/2020 à 17:36, David Li a écrit :
> Hi all,
> 
> Thanks to Antoine for implementing the core read coalescing logic.
> 
> We've taken a look at what else needs to be done to get this working,
> and it sounds like the following changes would be worthwhile,
> independent of the rest of the optimizations we discussed:
> 
> - Add benchmarks of the current Parquet reader with the current S3File
> (and other file implementations) so we can track
> improvements/regressions


Instead of S3, you can use the Slow streams and Slow filesystem
implementations.  It may better protect against varying external conditions.

> - Use the coalescing inside the Parquet reader (even without a column
> filter hint - this would subsume PARQUET-1698)

I'm assuming this would be done at the RowGroupReader level, right?

> - In coalescing, split large read ranges into smaller ones (this would
> further improve on PARQUET-1698 by taking advantage of parallel reads)

I don't understand what the "advantage" would be.  Can you elaborate?

Regards

Antoine.

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

Reply via email to