[I] Dedicated machine / setup for running benchmarks [datafusion]

via GitHub Fri, 17 Oct 2025 01:40:29 -0700


alamb opened a new issue, #18115:
URL: https://github.com/apache/datafusion/issues/18115


   ### Is your feature request related to a problem or challenge?
   
   As @rluvaton says in 
https://github.com/apache/datafusion/pull/17979#issuecomment-3387458032, the 
current benchmarking situation in DataFusion is not ideal
   
   > There are too much noise, making the benchmark results unreliable, can 
your script also print the machine you are working on (like a1.metal or 
c5a.large so it will be easier to understand the properties of the machine that 
it is running on)
   > 
   > also, is your machine a dedicated bare metal?
   
   We have many 
[benchmarks](https://github.com/apache/datafusion/blob/main/benchmarks/README.md)
 (`bench.sh`, `cargo bench` etc) which can be run by contributors. However they 
suffer from two problems
   1. **Not widely accessable**: It requires each contributor to run the 
scripts on their own machines
   2. **Results are noisy**: due to machines that are often used are either 
laptops or cloud VMs which are noisy environments due to thermal throttling 
and/or other workloads on the same machines
   
   For example, I have a google cloud VM that i use for benchmarks (scripts in 
my 
[`datafusion-benchmarking`](https://github.com/alamb/datafusion-benchmarking?tab=readme-ov-file)
 repo, which I run on PRs at author request or part of my own review. This 
suffers from the same problems of inaccessibility and noisy results
   
   ### Describe the solution you'd like
   
   I would like some way for community members to be able to run benchmarks on 
machines that is
   1. Automated (non manual) to request
   2. Relatively stable (e.g. on bare metal or other similar predictable 
hardware)
   
   ### Describe alternatives you've considered
   
   @rluvaton  offers the following suggestion on 
https://github.com/apache/datafusion/pull/17979#issuecomment-3387467255:
   
   > it would be great if we could have like in node.js and just allow to run 
the specific benchmarks on dedicated machines
   >
   > it will make sure that all benchmarks across PR are running on the same 
machine with the same config making the benchmark result more reliable 
   >
   >Example of how node.js does it:
   >
   <img width="2546" height="1213" alt="image" 
src="https://github.com/user-attachments/assets/714c54bb-85cb-4b47-b85b-9aadf5dd2153";
 />
   
   
   ### Additional context
   
   Related to 
   - https://github.com/apache/datafusion/issues/5504


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Dedicated machine / setup for running benchmarks [datafusion]

Reply via email to