Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

David Li Thu, 19 Mar 2020 08:05:13 -0700

> That's why it's important that we set ourselves up to do performance testing 
> in a realistic environment in AWS rather than simulating it.


For my clarification, what are the plans for this (if any)? I couldn't
find any prior discussion, though it sounds like the discussion around
cloud CI capacity would be one step towards this.

In the short term we could make tests/benchmarks configurable to not
point at a Minio instance so individual developers can at least try
things.

Best,
David

On 3/18/20, David Li <li.david...@gmail.com> wrote:
> For us it applies to S3-like systems, not only S3 itself, at least.
>
> It does make sense to limit it to some filesystems. The behavior would
> be opt-in at the Parquet reader level, so at the Datasets or
> Filesystem layer we can take care of enabling the flag for filesystems
> where it actually helps.
>
> I've filed these issues:
> - ARROW-8151 to benchmark S3File+Parquet
> (https://issues.apache.org/jira/browse/ARROW-8151)
> - ARROW-8152 to split large reads
> (https://issues.apache.org/jira/browse/ARROW-8152)
> - PARQUET-1820 to use a column filter hint with coalescing
> (https://issues.apache.org/jira/browse/PARQUET-1820)
>
> in addition to PARQUET-1698 which is just about pre-buffering the
> entire row group (which we can now do with ARROW-7995).
>
> Best,
> David
>
> On 3/18/20, Antoine Pitrou <anto...@python.org> wrote:
>>
>> Le 18/03/2020 à 18:30, David Li a écrit :
>>>> Instead of S3, you can use the Slow streams and Slow filesystem
>>>> implementations.  It may better protect against varying external
>>>> conditions.
>>>
>>> I think we'd want several different benchmarks - we want to ensure we
>>> don't regress local filesystem performance, and we also want to
>>> measure in an actual S3 environment. It would also be good to measure
>>> S3-compatible systems like Google's.
>>>
>>>>> - Use the coalescing inside the Parquet reader (even without a column
>>>>> filter hint - this would subsume PARQUET-1698)
>>>>
>>>> I'm assuming this would be done at the RowGroupReader level, right?
>>>
>>> Ideally we'd be able to coalesce across row groups as well, though
>>> maybe it'd be easier to start with within-row-group-only (I need to
>>> familiarize myself with the reader more).
>>>
>>>> I don't understand what the "advantage" would be.  Can you elaborate?
>>>
>>> As Wes said, empirically you can get more bandwidth out of S3 with
>>> multiple concurrent HTTP requests. There is a cost to doing so
>>> (establishing a new connection takes time), hence why the coalescing
>>> tries to group small reads (to fully utilize one connection) and split
>>> large reads (to be able to take advantage of multiple connections).
>>
>> If that's S3-specific (or even AWS-specific) it might better be done
>> inside the S3 filesystem.  For other filesystems I don't think it makes
>> sense to split reads.
>>
>> Regards
>>
>> Antoine.
>>
>

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

Reply via email to