Re: nofib comparisons between 7.0.4, 7.4.2, 7.6.1, and 7.6.2

Simon Marlow Wed, 06 Feb 2013 12:50:32 -0800

On 06/02/13 16:04, Johan Tibell wrote:

On Wed, Feb 6, 2013 at 2:09 AM, Simon Marlow <[email protected]
<mailto:[email protected]>> wrote:


    This is slightly off topic, but I wanted to plant this thought in
    people's brains: we shouldn't place much significance in the average
    of a bunch of benchmarks (even the geometric mean), because it
    assumes that the benchmarks have a sensible distribution, and we
    have no reason to expect that to be the case.  For example, in the
    results above, we wouldn't expect a 14.7% reduction in runtime to be
    seen in a typical program.

    Using the median might be slightly more useful, which here would be
    something around 0% for runtime, though still technically dodgy.
      When I get around to it I'll modify nofib-analyse to report
    medians instead of GMs.


Using the geometric mean as a way to summarize the results isn't that
bad. See "How not to lie with statistics: the correct way to summarize
benchmark results"
(http://ece.uprm.edu/~nayda/Courses/Icom6115F06/Papers/paper4.pdf).

Yes - our current usage of GM is because we read that paper :) I'vereported GMs of nofib programs in several papers. I'm not saying thepaper is wrong - the GM is definitely more correct than the AM foraveraging normalised results.

The problem is that we're attributing equal weight to all of ourbenchmarks, without any reason to expect that they are representative.We collect as many benchmarks as we can and hope they arerepresentative, but in fact it's rarely the case: often a particularoptimisation or regression will hit just one or two benchmarks. So allI'm saying is that we shouldn't expect the GM to be representative.Often there's no sensible mean at all - saying "some programs get a lotbetter but most don't change" is far more informative than "on averageprograms got faster by 1.2%".

That being said, I think the most useful thing to do is to look at the
big losers, as they're often regressions. Making some class of programs
much worse is but improving the geometric mean overall is often worse
than changing nothing at all.


Absolutely.

Cheers,
        Simon


_______________________________________________
ghc-devs mailing list
[email protected]
http://www.haskell.org/mailman/listinfo/ghc-devs

Re: nofib comparisons between 7.0.4, 7.4.2, 7.6.1, and 7.6.2

Reply via email to