BlakeOrth opened a new pull request, #18361:
URL: https://github.com/apache/datafusion/pull/18361
## Which issue does this PR close?
N/A -- This PR is a POC meant for discussion to inform decisions related to
- https://github.com/apache/datafusion/pull/18146
## Rationale for this change
This PR is to share the code of some benchmarks for reproduction of results
and discussion of results.
## What changes are included in this PR?
This PR includes a set of benchmarks that exercise the
`pruned_partition_list` method (both the `original` (current main) and the
`list_all` (PR)) to allow us to make a more informed decision on the potential
path(s) forward related to #18146.
**This code is not intended for merge**.
Please generally avert your eyes from the benchmark code, because it comes
with an :rotating_light: :robot: AI generated code :robot: :rotating_light:
warning. There's a bunch of _really_ silly decisions the robot made and if we
actually wanted to introduce permanent benchmarks we'd likely want to pare down
the cases and re-write them. At the current time I am more interested in
exploring results rather than nit-picking benchmark code, however I did ensure
the actual timing loops were as tight as possible to ensure the results are
trustworthy representations for both implementations.
The benchmarks themselves include both an in-memory benchmark and an S3
benchmark that uses a local MinIO via `testcontainers`. The in-memory benches
are more-or-less what I started with, and at this point are mostly there
because they're academically interesting. The S3 benchmarks are necessary to
truly understand the end-user performance for list operations because list
operations for commercial object stores are paged with 1000 results per page.
To add an additional dose of realism, the results included with this PR were
run using a simulated latency of 120ms applied to my localhost interface using
`tc` in linux. Each underlying partition structure benchmark is run twice for
each implementation, once to collect all the results from the list operation,
and again to collect the time-to-first-result (TTFR) from the file stream.
In order to better facilitate discussion on the results I have included both
"raw" criterion results as text and a formatted table of the results as a
markdown doc that's a bit easier to read. The "raw" criterion results are
edited to remove some of the text-noise (improve/regression output that was not
useful/accurate, warmup text etc.) and have had separators added just to make
them a bit easier to navigate/digest. I think using in-line comments on the
various table entries in `s3_results_formatted.md` is probably the easiest way
to thread the discussion around the results, but I'm happy to facilitate other
options.
## Are these changes tested?
They are tests.
## Are there any user-facing changes?
no
##
cc @alamb
I was initially planning on adding some comments of my own interpretations
to the benchmark results right after I submitted this PR to start the
discussion, but in some sense I don't want the "poison the well" of additional
perspectives. If you'd like me to start the discussion/interpretation I'd be
happy to do so, just let me know and I can add my current thoughts.
## Additional Notes:
If anyone wants to try these locally using the simulated latency, you can
use this command (run as root):
```console
tc qdisc add dev lo root handle 1:0 netem delay 60msec
```
This adds 60ms to each access across the `lo` device, resulting in a 120ms
round trip latency.
Since you're unlikely to want latency on localhost forever, you can reset it:
```console
tc qdisc del dev lo root
```
I'm not sure if this functionality or similar exists on MacOS, and I don't
believe there is a Windows solution.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]