On Wed, 28 Sep 2011 08:19:16 -0400, Christophe <trav...@phare.normalesup.org> 
wrote:
"Robert Jacques" , dans le message (digitalmars.D:145443), a écrit :
*sigh* By multiple runs, I mean multiple runs of the benchmark function, 
_including_ the loops. From my post last Tuesday:

     double min_f1 = int.max;
     double min_f2 = int.max;
     foreach(i;0..10_000) {
         auto b=benchmark!(f1,f2)(3);
         min_f1 = min(min_f1, b[0].to!("seconds",double));
         min_f2 = min(min_f2, b[1].to!("seconds",double));
     }

All the looping that benchmark currently does is to lengthen the total
time a function takes. This makes it possible to avoid sampling errors
with regard to the performance counter itself. But you are still only
making a single, noisy measurement of a (meta)function. In order to
get a robust result, you need to make multiple measurements and take
the minimum.

Sorry to question you, but what makes you say that the minimum is more
interesting than, let's say, the mean + the standard deviation. Is there
any paper supporting this ?

I am not a benchmark specialist, but an average statistician who knows
that the minimum is not always very informative (a very low value can be
a piece of luck). I am sure you may have good reasons to choose the
minimum, but I just want to make sure we make the right choice by
choosing to use the minimun of consecutive benchmarks.


The reason that the mean + std is so pervasive in science is that it's perfect 
for a Gaussian noise model. In most physical measurements, you expect some 
readings to be both above and below the actual value and for most processes we 
assume the error of these measurements is normally distributed. Importantly, 
when we know the process not to produce normally distibuted error, we look to 
other statistics for meaning. Now, for computer timing measurements, you have 
access to the physical number of clock cycles taken. There is some noise in 
this, but it's on the order of 1-2 cycles (i.e. ns). As a benchmark is trying 
to measure the number of cycles something takes to complete, and it is 
physically impossible to complete that task in less than a fixed number of 
cycles, 'noise' in the measurement above 1-2 ns is by definition purely 
additive due to OS or hardware effects (i.e. a context switch occurred, a page 
fault happened, you put your computer to sleep, etc.). So the best way t!
o get to the true number of cycles a task takes to complete is to take the minimum recorded value, on the premise that you did get lucky and those other issues which you are not trying to measure, had little to no impact.

So that's the rational. As for academic support, I've published peer-reviewed 
papers using this methodology, but I can't remember where I originally got the 
advice. Also, there are certain types of benchmarks for which the mean + std is 
better (networking comes to mind), etc.

P.S. A little Google-fu turned up: The science of computer benchmarking By 
Roger W. Hockney, which does recommend using the minimum, particularly to get 
around the problem of time-slicing (i.e. context switching) by the OS.

Reply via email to