On Wed, 28 Sep 2011 08:19:16 -0400, Christophe <trav...@phare.normalesup.org>
wrote:
"Robert Jacques" , dans le message (digitalmars.D:145443), a écrit :
*sigh* By multiple runs, I mean multiple runs of the benchmark function,
_including_ the loops. From my post last Tuesday:
double min_f1 = int.max;
double min_f2 = int.max;
foreach(i;0..10_000) {
auto b=benchmark!(f1,f2)(3);
min_f1 = min(min_f1, b[0].to!("seconds",double));
min_f2 = min(min_f2, b[1].to!("seconds",double));
}
All the looping that benchmark currently does is to lengthen the total
time a function takes. This makes it possible to avoid sampling errors
with regard to the performance counter itself. But you are still only
making a single, noisy measurement of a (meta)function. In order to
get a robust result, you need to make multiple measurements and take
the minimum.
Sorry to question you, but what makes you say that the minimum is more
interesting than, let's say, the mean + the standard deviation. Is there
any paper supporting this ?
I am not a benchmark specialist, but an average statistician who knows
that the minimum is not always very informative (a very low value can be
a piece of luck). I am sure you may have good reasons to choose the
minimum, but I just want to make sure we make the right choice by
choosing to use the minimun of consecutive benchmarks.
The reason that the mean + std is so pervasive in science is that it's perfect
for a Gaussian noise model. In most physical measurements, you expect some
readings to be both above and below the actual value and for most processes we
assume the error of these measurements is normally distributed. Importantly,
when we know the process not to produce normally distibuted error, we look to
other statistics for meaning. Now, for computer timing measurements, you have
access to the physical number of clock cycles taken. There is some noise in
this, but it's on the order of 1-2 cycles (i.e. ns). As a benchmark is trying
to measure the number of cycles something takes to complete, and it is
physically impossible to complete that task in less than a fixed number of
cycles, 'noise' in the measurement above 1-2 ns is by definition purely
additive due to OS or hardware effects (i.e. a context switch occurred, a page
fault happened, you put your computer to sleep, etc.). So the best way t!
o get
to the true number of cycles a task takes to complete is to take the minimum recorded value, on the premise that you did get lucky and those other issues which you are not trying to measure, had little to no impact.
So that's the rational. As for academic support, I've published peer-reviewed
papers using this methodology, but I can't remember where I originally got the
advice. Also, there are certain types of benchmarks for which the mean + std is
better (networking comes to mind), etc.
P.S. A little Google-fu turned up: The science of computer benchmarking By
Roger W. Hockney, which does recommend using the minimum, particularly to get
around the problem of time-slicing (i.e. context switching) by the OS.