[PR] Adds some POC benchmarks for pruned_partition_list [datafusion]

via GitHub Wed, 29 Oct 2025 11:36:05 -0700


BlakeOrth opened a new pull request, #18361:
URL: https://github.com/apache/datafusion/pull/18361


   ## Which issue does this PR close?
   
   N/A -- This PR is a POC meant for discussion to inform decisions related to
    - https://github.com/apache/datafusion/pull/18146
   
   ## Rationale for this change
   
   This PR is to share the code of some benchmarks for reproduction of results 
and discussion of results.
   
   ## What changes are included in this PR?
   
   This PR includes a set of benchmarks that exercise the 
`pruned_partition_list` method (both the `original` (current main) and the 
`list_all` (PR)) to allow us to make a more informed decision on the potential 
path(s) forward related to #18146. 
   
   **This code is not intended for merge**. 
   
   Please generally avert your eyes from the benchmark code, because it comes 
with an :rotating_light: :robot: AI generated code :robot: :rotating_light: 
warning. There's a bunch of _really_ silly decisions the robot made and if we 
actually wanted to introduce permanent benchmarks we'd likely want to pare down 
the cases and re-write them. At the current time I am more interested in 
exploring results rather than nit-picking benchmark code, however I did ensure 
the actual timing loops were as tight as possible to ensure the results are 
trustworthy representations for both implementations.
   
   The benchmarks themselves include both an in-memory benchmark and an S3 
benchmark that uses a local MinIO via `testcontainers`. The in-memory benches 
are more-or-less what I started with, and at this point are mostly there 
because they're academically interesting. The S3 benchmarks are necessary to 
truly understand the end-user performance for list operations because list 
operations for commercial object stores are paged with 1000 results per page. 
To add an additional dose of realism, the results included with this PR were 
run using a simulated latency of 120ms applied to my localhost interface using 
`tc` in linux. Each underlying partition structure benchmark is run twice for 
each implementation, once to collect all the results from the list operation, 
and again to collect the time-to-first-result (TTFR) from the file stream.
   
   In order to better facilitate discussion on the results I have included both 
"raw" criterion results as text and a formatted table of the results as a 
markdown doc that's a bit easier to read. The "raw" criterion results are 
edited to remove some of the text-noise (improve/regression output that was not 
useful/accurate, warmup text etc.) and have had separators added just to make 
them a bit easier to navigate/digest. I think using in-line comments on the 
various table entries in `s3_results_formatted.md` is probably the easiest way 
to thread the discussion around the results, but I'm happy to facilitate other 
options.
   
   ## Are these changes tested?
   
   They are tests.
   
   ## Are there any user-facing changes?
   no
   
   ##
   cc @alamb 
   I was initially planning on adding some comments of my own interpretations 
to the benchmark results right after I submitted this PR to start the 
discussion, but in some sense I don't want the "poison the well" of additional 
perspectives. If you'd like me to start the discussion/interpretation I'd be 
happy to do so, just let me know and I can add my current thoughts.
   
   ## Additional Notes:
   If anyone wants to try these locally using the simulated latency, you can 
use this command (run as root):
   ```console
   tc qdisc add dev lo root handle 1:0 netem delay 60msec
   ```
   This adds 60ms to each access across the `lo` device, resulting in a 120ms 
round trip latency.
   
   Since you're unlikely to want latency on localhost forever, you can reset it:
   ```console
   tc qdisc del dev lo root
   ```
   
   I'm not sure if this functionality or similar exists on MacOS, and I don't 
believe there is a Windows solution.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Adds some POC benchmarks for pruned_partition_list [datafusion]

Reply via email to