> That's why it's important that we set ourselves up to do performance testing > in a realistic environment in AWS rather than simulating it.
For my clarification, what are the plans for this (if any)? I couldn't find any prior discussion, though it sounds like the discussion around cloud CI capacity would be one step towards this. In the short term we could make tests/benchmarks configurable to not point at a Minio instance so individual developers can at least try things. Best, David On 3/18/20, David Li <li.david...@gmail.com> wrote: > For us it applies to S3-like systems, not only S3 itself, at least. > > It does make sense to limit it to some filesystems. The behavior would > be opt-in at the Parquet reader level, so at the Datasets or > Filesystem layer we can take care of enabling the flag for filesystems > where it actually helps. > > I've filed these issues: > - ARROW-8151 to benchmark S3File+Parquet > (https://issues.apache.org/jira/browse/ARROW-8151) > - ARROW-8152 to split large reads > (https://issues.apache.org/jira/browse/ARROW-8152) > - PARQUET-1820 to use a column filter hint with coalescing > (https://issues.apache.org/jira/browse/PARQUET-1820) > > in addition to PARQUET-1698 which is just about pre-buffering the > entire row group (which we can now do with ARROW-7995). > > Best, > David > > On 3/18/20, Antoine Pitrou <anto...@python.org> wrote: >> >> Le 18/03/2020 à 18:30, David Li a écrit : >>>> Instead of S3, you can use the Slow streams and Slow filesystem >>>> implementations. It may better protect against varying external >>>> conditions. >>> >>> I think we'd want several different benchmarks - we want to ensure we >>> don't regress local filesystem performance, and we also want to >>> measure in an actual S3 environment. It would also be good to measure >>> S3-compatible systems like Google's. >>> >>>>> - Use the coalescing inside the Parquet reader (even without a column >>>>> filter hint - this would subsume PARQUET-1698) >>>> >>>> I'm assuming this would be done at the RowGroupReader level, right? >>> >>> Ideally we'd be able to coalesce across row groups as well, though >>> maybe it'd be easier to start with within-row-group-only (I need to >>> familiarize myself with the reader more). >>> >>>> I don't understand what the "advantage" would be. Can you elaborate? >>> >>> As Wes said, empirically you can get more bandwidth out of S3 with >>> multiple concurrent HTTP requests. There is a cost to doing so >>> (establishing a new connection takes time), hence why the coalescing >>> tries to group small reads (to fully utilize one connection) and split >>> large reads (to be able to take advantage of multiple connections). >> >> If that's S3-specific (or even AWS-specific) it might better be done >> inside the S3 filesystem. For other filesystems I don't think it makes >> sense to split reads. >> >> Regards >> >> Antoine. >> >