Re: Re: potential for GHC benchmarks w.r.t. optimisations being incorrect

Andreas Klebinger Sun, 06 May 2018 07:41:27 -0700

Joachim Breitner schrieb:

This runs on a dedicated physical machine, and still the run-time
numbers were varying too widely and gave us many false warnings (and
probably reported many false improvements which we of course were happy
to believe). I have since switched to measuring only dynamic
instruction counts with valgrind. This means that we cannot detect
improvement or regressions due to certain low-level stuff, but we gain
the ability to reliably measure *something* that we expect to change
when we improve (or accidentally worsen) the high-level
transformations.

While this matches my experience with the default settings, I had goodresults by tuning the number of measurements nofib does.With a high number of NoFibRuns (30+) , disabling frequency scaling,stopping background tasks and walking away from the computertill it was done I got noise down to differences of about +/-0.2% forsubsequent runs.

This doesn't eliminate alignment bias and the like but at least it givesfairly reproducible results.


Sven Panne schrieb:

4% is far from being "big", look e.g. athttps://dendibakh.github.io/blog/2018/01/18/Code_alignment_issues<https://dendibakh.github.io/blog/2018/01/18/Code_alignment_issues>where changing just the alignment of the code lead to a 10%difference. :-/ The code itself or its layout wasn't changed at all.The "Producing Wrong Data Without Doing Anything Obviously Wrong!"paper gives more funny examples.
I'm not saying that code layout has no impact, quite the opposite. Themain point is: Do we really have a benchmarking machinery in placewhich can tell you if you've improved the real run time or made itworse? I doubt that, at least at the scale of a few percent. To reachjust that simple yes/no conclusion, you would need quite a heavymachinery involving randomized linking order, varying environments (inthe sense of "number and contents of environment variables"), variousCPU models etc. If you do not do that, modern HW will leave you with alot of "WTF?!" moments and wrong conclusions.

You raise good points. While the example in the blog seems a bitconstructed with the whole loop fitting in a cache line the principle isa real concern though.I've hit alignment issues and WTF moments plenty of times in the pastwhen looking at micro benchmarks.

However on the scale of nofib so far I haven't really seen this happen.It's good to be aware of the chance for a whole suite to give

wrong results though.

I wonder if this effect is limited by GHC's tendency to use 8 bytealignment for all code (at least with tables next to code)?If we only consider 16byte (DSB Buffer) and 32 Byte (Cache Lines)relevant this reduces the possibilities by a lot after all.

In the particular example I've hit however it's pretty obvious thatalignment is not the issue. (And I still verified that).In the end how big the impact of a better layout would be in general ishard to quantify. Hence the question if anyone has

pointers to good literature which looks into this.

Cheers
Andreas

_______________________________________________
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

Re: Re: potential for GHC benchmarks w.r.t. optimisations being incorrect

Reply via email to