On Mon, 26 Sep 2011 09:39:28 -0400, Andrei Alexandrescu 
<seewebsiteforem...@erdani.org> wrote:

On 9/25/11 11:56 PM, Robert Jacques wrote:
On Sun, 25 Sep 2011 23:06:01 -0400, Andrei Alexandrescu
<seewebsiteforem...@erdani.org> wrote:
So, std.benchmark should
[ ] Set the affinity to a single thread at start (i.e. use
SetProcessAffinityMask, etc)

I think that won't influence most measurements measurably.

First, a core jump is guaranteed to, essentially, poison the cache.
Which, at least in my eyes, invalidates that particular timing run.

Well it will be one of many runs. The questions are if (a) core jumps
are likely to happen during a benchmark run, (b) many enough will happen
to influence a 500ms time window.

I don't know exactly how likely core jump are, but context switches definitely 
happen. And yes, in my experience, they influence a 50 ms time window. 
Remember, context switches have two outcomes: your thread continues running or 
your thread is paused. But the timer doesn't pause, when your thread pauses. So 
all you need is one bad switch to see an effect.

Second, timing generally relies on the CPUs Time Stamp Counter, which is
not multi-thread safe; a core switch invalidates all previous TSC
values, and hence, the time measurement itself. Furthermore, the TSC is
not even guaranteed to have a fixed frequency on some CPUs. Now there
are ways around the problems of the TSC, but even so:

(From the Wikipedia)
"Under Windows platforms, Microsoft strongly discourages using the TSC
for high-resolution timing for exactly these reasons, providing instead
the Windows APIs QueryPerformanceCounter and
QueryPerformanceFrequency.[2] Even when using these functions, Microsoft
recommends the code to be locked to a single CPU."

std.benchmark uses QueryPerformanceCounter on Windows and
clock_gettime/gettimeofday on Unix.

Great, but MS still recommends benchmarking be done on a single core. And if MS 
thinks that is how benchmarking should be done, I think that's how we should do 
it.

From MSDN documentation on QueryPerformanceCounter:
"On a multiprocessor computer, it should not matter which processor is called. 
However, you can get different results on different processors due to bugs in the basic 
input/output system (BIOS) or the hardware abstraction layer (HAL). To specify processor 
affinity for a thread, use the SetThreadAffinityMask function."

[ ] Repeat the measurement N times, taking the min.

std.benchmark repeats the measurement in a loop, discounts the times
spent in the iteration proper, and divides total time by the number of
iterations to figure the time per iteration. This has the advantage that
works with even very fast functions without letting the counter itself
affect the timing.

Which is necessity but not sufficient for proper benchmarking. The
timers themselves are noisy, to say nothing of the effects of context
switches, warm-up, etc on a run. During the ptr+length vs ptr+ptr
discussion on Tuesday, the naive use benchmark lead to some very wrong
conclusions simply because any single run had up to ~5% of error. The
only way to get rid of this error, is to make multiple runs.

You may want to take a look at the code. This is a long argument ending
with "...make multiple runs" as if std.benchmark is not doing them and
you suggest it should. Again, std.benchmark runs the tested function in
a loop in powers of 10 until the entire run lasts at least 500ms.

*sigh* By multiple runs, I mean multiple runs of the benchmark function, 
_including_ the loops. From my post last Tuesday:

    double min_f1 = int.max;
    double min_f2 = int.max;
    foreach(i;0..10_000) {
        auto b=benchmark!(f1,f2)(3);
        min_f1 = min(min_f1, b[0].to!("seconds",double));
        min_f2 = min(min_f2, b[1].to!("seconds",double));
    }

All the looping that benchmark currently does is to lengthen the total time a 
function takes. This makes it possible to avoid sampling errors with regard to 
the performance counter itself. But you are still only making a single, noisy 
measurement of a (meta)function. In order to get a robust result, you need to 
make multiple measurements and take the minimum.

[X] Adaptively increase the number of loop iterations to ensure a valid
reading.

The current system tries 10, 100, 1000 iterations until it gets to a
total run time of at least 0.5 seconds. That's quite a generous margin,
I plan to tighten it.

[ ] Adaptively decrease the number of loop iterations to ensure minimal
context switching.

OK.

In fact upon further thinking the iterations should not be too few.
Context switches and other activity is inevitable, but the effects are
negligible over long periods (as half a second would be).

To conclude, I'm unsure what steps you suggest to take to improve
std.benchmark, and I'd be grateful if you could clarify them after
perusing its code and documentation.


Thanks,

Andrei

Reply via email to