On Mon, 28 Nov 2005, Joe Buck wrote:
On Mon, 28 Nov 2005, Mark Mitchell wrote:
We're collectively putting a lot of energy into performance
improvements in GCC. Sometimes, a performance gain from one patch gets
undone by another patch -- which is itself often doing something else
beneficial. People have mentioned to me that we require people to run
regression tests for correctness, but that we don't really have
anything equivalent for performance.
It would be possible to detect performance regression after fact, but
soon enough to look at reverting patches. For example, given multiple
machines doing SPEC benchmark runs every night, the alarm could be raised
if a significant performance regression is detected. To guard against
noise from machine hiccups, two different machines would have to report
a regression to raise the alarm. But the big problem is the non-freeness
of SPEC; ideally there would be a benchmark that ...
... everyone can download and run
... is reasonably fast
... is non-trivial
Yes! This would be very useful for other free software projects.
Another possible requirement is that the tests are not too large; it
would be nice to include them in the source code of one's project for
easier integration.
As a strawman, perhaps we could add a small integer program (bzip?) and
a small floating-point program to the testsuite, and have DejaGNU print
out the number of iterations of each that run in 10 seconds.
Would that really catch much?
I've been thinking about this kind of thing recently for Valgrind. I was
thinking that a combination of real programs and artificial
microbenchmarks would be good. The microbenchmarks would be like the GCC
(correctness) torture tests -- a collection of programs, added to over
time, each one demonstrating a prior performance bug. You could start it
off with a few tests containing things like key inner loops extracted from
programs such as bzip2.
Measuring the programs and categorizing regressions is tricky. It's
possible that the artificial tests would be small enough that any
regression would be obvious (eg. failing to remove that extra instruction
would cause a 10% slowdown). And CSiBE-style graphing is very effective
for seeing trends.
Nick