On 9/21/12 2:49 PM, David Piepgrass wrote:
After extensive tests with a variety of aggregate functions, I can say
firmly that taking the minimum time is by far the best when it comes
to assessing the speed of a function.

Like others, I must also disagree in princple. The minimum sounds like a
useful metric for functions that (1) do the same amount of work in every
test and (2) are microbenchmarks, i.e. they measure a small and simple
task.

That is correct.

If the benchmark being measured either (1) varies the amount of
work each time (e.g. according to some approximation of real-world
input, which obviously may vary)* or (2) measures a large system, then
the average and standard deviation and even a histogram may be useful
(or perhaps some indicator whether the runtimes are consistent with a
normal distribution or not). If the running-time is long then the max
might be useful (because things like task-switching overhead probably do
not contribute that much to the total).

* I anticipate that you might respond "so, only test a single input per
benchmark", but if I've got 1000 inputs that I want to try, I really
don't want to write 1000 functions nor do I want 1000 lines of output
from the benchmark. An average, standard deviation, min and max may be
all I need, and if I need more detail, then I might break it up into 10
groups of 100 inputs. In any case, the minimum runtime is not the
desired output when the input varies.

I understand. What we currently do at Facebook is support benchmark functions with two parameters (see https://github.com/facebook/folly/blob/master/folly/docs/Benchmark.md). One is the number of iterations, the second is "problem size", akin to what you're discussing.

I chose to not support that in this version of std.benchmark because it can be tackled later easily, but I probably need to add it now, sigh.

It's a little surprising to hear "The purpose of std.benchmark is not to
estimate real-world time. (That is the purpose of profiling)"...
Firstly, of COURSE I would want to estimate real-world time with some of
my benchmarks. For some benchmarks I just want to know which of two or
three approaches is faster, or to get a coarse ball-park sense of
performance, but for others I really want to know the wall-clock time
used for realistic inputs.

I would contend that a benchmark without a baseline is very often misguided. I've seen tons and tons and TONS of nonsensical benchmarks lacking a baseline. "I created one million smart pointers, it took me only one millisecond!" Well how long did it take you to create one million dumb pointers?

Choosing good baselines and committing to good comparisons instead of un-based absolutes is what makes the difference between a professional and a well-intended dilettante.

Secondly, what D profiler actually helps you answer the question "where
does the time go in the real-world?"? The D -profile switch creates an
instrumented executable, which in my experience (admittedly not
experience with DMD) severely distorts running times. I usually prefer
sampling-based profiling, where the executable is left unchanged and a
sampling program interrupts the program at random and grabs the call
stack, to avoid the distortion effect of instrumentation. Of course,
instrumentation is useful to find out what functions are called the most
and whether call frequencies are in line with expectations, but I
wouldn't trust the time measurements that much.

As far as I know, D doesn't offer a sampling profiler, so one might
indeed use a benchmarking library as a (poor) substitute. So I'd want to
be able to set up some benchmarks that operate on realistic data, with
perhaps different data in different runs in order to learn about how the
speed varies with different inputs (if it varies a lot then I might
create more benchmarks to investigate which inputs are processed
quickly, and which slowly.)

I understand there's a good case to be made for profiling. If this turns out to be an acceptance condition for std.benchmark (which I think it shouldn't), I'll define one.

Some random comments about std.benchmark based on its documentation:

- It is very strange that the documentation of printBenchmarks uses
neither of the words "average" or "minimum", and doesn't say how many
trials are done....

Because all of those are irrelevant and confusing. We had an older framework at Facebook that reported those numbers, and they were utterly and completely meaningless. Besides the trials column contained numbers that were not even comparable. Everybody was happy when I removed them with today's simple and elegant numbers.

I suppose the obvious interpretation is that it only
does one trial, but then we wouldn't be having this discussion about
averages and minimums right? Øivind says tests are run 1000 times... but
it needs to be configurable per-test (my idea: support a _x1000 suffix
in function names, or _for1000ms to run the test for at least 1000
milliseconds; and allow a multiplier when when running a group of
benchmarks, e.g. a multiplier argument of 0.5 means to only run half as
many trials as usual.)

I don't think that's a good idea.

Also, it is not clear from the documentation what
the single parameter to each benchmark is (define "iterations count".)

The documentation could include that, but I don't want to overspecify.

- The "benchmark_relative_" feature looks quite useful. I'm also happy
to see benchmarkSuspend() and benchmarkResume(), though
benchmarkSuspend() seems redundant in most cases: I'd like to just call
one function, say, benchmarkStart() to indicate "setup complete, please
start measuring time now."

Good point. I think this is a minor encumbrance, so it's good to keep generality.

- I'm glad that StopWatch can auto-start; but the documentation should
be clearer: does reset() stop the timer or just reset the time to zero?
does stop() followed by start() start from zero or does it keep the time
on the clock? I also think there should be a method that returns the
value of peek() and restarts the timer at the same time (perhaps stop()
and reset() should just return peek()?)

- After reading the documentation of comparingBenchmark and measureTime,
I have almost no idea what they do.

Yah, these are moved over from std.datetime. I'll need to make a couple more passes through the dox.


Andrei

Reply via email to