our benchmark-suite

Andy Wingo Mon, 23 Apr 2012 09:33:54 -0700

Hi,

I was going to try to optimize vhash-assoc, but I wanted a good
benchmark first, so I started to look at our benchmark suite.  We have
some issues to deal with.


For those of you who are not familiar with the benchmark suite, we have
a bunch of benchmarks in benchmark-suite/benchmarks/: those files that
end in ".bm".  The format of a .bm file is like our .test files, except
that instead of `pass-if' and the like, we have `benchmark'.  You run
benchmarks via ./benchmark-guile in the $top_builddir.

The benchmarking framework tries to be appropriate for microbenchmarks,
as the `benchmark' form includes a suggested number of iterations.
Ideally when you create a benchmark, you give it a number of iterations
that makes it run approximately as long as the other benchmarks.

When the benchmarking suite was first made, 10 years ago, there was an
empty "reference" benchmark that was created to run for approximately 1
second.  Currently it runs in 0.012 seconds.  This is one problem: the
overall suite has old iteration counts.  There is a facility for scaling
the iteration counts of the suite as a whole, but it is unused.

Another problem is that the actual runtime of the various benchmarks
varies quite a lot, from 3.3 seconds for assoc (srfi-1), to 0.012 for
if.bm.

Short runtimes magnify imprecisions in measurement.  It used to be that
the measurement function was "times", but I just changed that to the
higher-precision get-internal-real-time / get-internal-run-time.  Still,
though, there is nothing you can do for a benchmark that runs in a few
milliseconds or less.

Another big problem is that some effect-free microbenchmarks optimize
away.  For example, the computations in arithmetic.bm fold entirely.
The same goes for if.bm.  These benchmarks do not measure anything
useful.

The benchmarking suite attempts to compensate for the overhead of the
test by providing for "core time": the time taken to run a benchmark,
minus the time taken to run an empty benchmark with the same number of
iterations.  The benchmark itself is compiled as a thunk, and the
framework calls the thunk repeatedly.  In theory this sounds good.  In
practice however, for high-iteration microbenchmarks, the overhead of
the thunk call outweighs any micro-benchmark being called.

For what it's worth, the current overhead of the benchmark appears to be
about 35 microseconds per iteration, on my laptop.  If we inline the
iteration into the benchmark itself, rather than calling a thunk
repeatedly, we can bring that down to around 13 microseconds.  However
it's probably best to leave it as it is, because if we inline the loop,
it's liable to be optimized out.

So, those are the problems: benchmarks running for inappropriate,
inconsistent durations; inappropriate benchmarks; and benchmarks being
optimized out.

My proposal is to rebase the iteration count in 0-reference.bm to run
for 0.5s on some modern machine, and adjust all benchmarks to match,
removing those benchmarks that do not measure anything useful.  Finally
we should perhaps enable automatic scaling of the iteration count.  What
do folks think about that?

On the positive side, all of our benchmarks are very clear that they are
a time per number of iterations, and so this change should not affect
users that measure time per iteration.

Regards,

Andy
-- 
http://wingolog.org/

our benchmark-suite

Reply via email to