Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

David Li Wed, 18 Mar 2020 10:30:26 -0700

> Instead of S3, you can use the Slow streams and Slow filesystem 
> implementations.  It may better protect against varying external conditions.

I think we'd want several different benchmarks - we want to ensure we
don't regress local filesystem performance, and we also want to
measure in an actual S3 environment. It would also be good to measure
S3-compatible systems like Google's.

>> - Use the coalescing inside the Parquet reader (even without a column
>> filter hint - this would subsume PARQUET-1698)
>
> I'm assuming this would be done at the RowGroupReader level, right?

Ideally we'd be able to coalesce across row groups as well, though
maybe it'd be easier to start with within-row-group-only (I need to
familiarize myself with the reader more).

> I don't understand what the "advantage" would be.  Can you elaborate?

As Wes said, empirically you can get more bandwidth out of S3 with
multiple concurrent HTTP requests. There is a cost to doing so
(establishing a new connection takes time), hence why the coalescing
tries to group small reads (to fully utilize one connection) and split
large reads (to be able to take advantage of multiple connections).

I will file issues and link them to ARROW-7995. Since there was
interest around PARQUET-1698, hopefully breaking up the tasks will
make it easier for everyone involved to collaborate.

Thanks,
David

On 3/18/20, Wes McKinney <wesmck...@gmail.com> wrote:
> On Wed, Mar 18, 2020 at 11:42 AM Antoine Pitrou <anto...@python.org> wrote:
>>
>>
>> Le 18/03/2020 à 17:36, David Li a écrit :
>> > Hi all,
>> >
>> > Thanks to Antoine for implementing the core read coalescing logic.
>> >
>> > We've taken a look at what else needs to be done to get this working,
>> > and it sounds like the following changes would be worthwhile,
>> > independent of the rest of the optimizations we discussed:
>> >
>> > - Add benchmarks of the current Parquet reader with the current S3File
>> > (and other file implementations) so we can track
>> > improvements/regressions
>>
>> Instead of S3, you can use the Slow streams and Slow filesystem
>> implementations.  It may better protect against varying external
>> conditions.
>>
>> > - Use the coalescing inside the Parquet reader (even without a column
>> > filter hint - this would subsume PARQUET-1698)
>>
>> I'm assuming this would be done at the RowGroupReader level, right?
>>
>> > - In coalescing, split large read ranges into smaller ones (this would
>> > further improve on PARQUET-1698 by taking advantage of parallel reads)
>>
>> I don't understand what the "advantage" would be.  Can you elaborate?
>
> Empirically it is known to S3 users that parallelizing reads improves
> throughput. I think it has to do with the way that Amazon's
> infrastructure works. That's why it's important that we set ourselves
> up to do performance testing in a realistic environment in AWS rather
> than simulating it.
>
>> Regards
>>
>> Antoine.
>

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

Reply via email to