Hi Simon et al,

On Jun 25, 2010, at 14:39 PM, Simon Marlow wrote:

> On 25/06/2010 00:24, Andy Georges wrote:
> 
>> <snip> 
>> Are there any inputs available that allow the real part of the suite
>> to run for a sufficiently long time? We're going to use criterion in
>> any case given our own expertise with rigorous benchmarking [3,4],
>> but since we've made a case in the past against short running apps on
>> managed runtime systems [5], we'd love to have stuff that runs at
>> least in the order of seconds, while doing useful things. All
>> pointers are much appreciated.
> 
> The short answer is no, although some of the benchmarks have tunable input 
> sizes (mainly the spectral ones) and you can 'make mode=slow' to run those 
> with larger inputs.
> 
> More generally, the nofib suite really needs an overhaul or replacement.  
> Unfortunately it's a tiresome job and nobody really wants to do it. There 
> have been various abortive efforts, including nobench and HaBench.  Meanwhile 
> we in the GHC camp continue to use nofib, mainly because we have some tool 
> infrastructure set up to digest the results (nofib-analyse).  Unfortunately 
> nofib has steadily degraded in usefulness over time due to both faster 
> processors and improvements in GHC, such that most of the programs now run 
> for less than 0.1s and are ignored by the tools when calculating averages 
> over the suite.

Right. I have the distinct feeling this is a major lack in the Haskell world. 
SPEC evolved over time to include larger benchmarks that still excercise the 
various parts of the hardware, such that the benchmarks does not achieve 
suddenly a large improvement on a new architecture/implementation due to e.g. a 
larger cache and the working sets remain in the cache for the entire execution. 
The Haskell community has nothing that remotely resembles a decent suite. You 
could do experiments and show that over 10K iterations, the average execution 
time per iteration goes from 500ms to 450ms, but what does this really mean? 

> We have a need not just for plain Haskell benchmarks, but benchmarks that test
> 
> - GHC extensions, so we can catch regressions
> - parallelism (see nofib/parallel)
> - concurrency (see nofib/smp)
> - the garbage collector (see nofib/gc)
> 
> I tend to like quantity over quality: it's very common to get just one 
> benchmark in the whole suite that shows a regression or exercises a 
> particular corner of the compiler or runtime.  We should only keep benchmarks 
> that have a tunable input size, however.

I would suggest that the first category might be made up of microbenchmarks, as 
I do not think it really is needed for performance per se. However, the other 
categories really need long-running benchmarks, that use (preferable) heaps of 
RAM, even when they're well tuned.

> Criterion works best on programs that run for short periods of time, because 
> it runs the benchmark at least 100 times, whereas for exercising the GC we 
> really need programs that run for several seconds.  I'm not sure how best to 
> resolve this conflict.

I'm not sure about this. Given the fact that there's quite some non-determinism 
in modern CPUs and that computer systems seem to behave chaotically [1], I 
definitely see the need to employ Criterion for longer running applications as 
well. It might not  need 100 executions, or multiple iterations per execution 
(incidentally, those iterations, can they be said to be independent?), but 
somewhere around 20 - 30 seems to be a minimum. 

> 
> Meanwhile, I've been collecting pointers to interesting programs that cross 
> my radar, in anticipation of waking up with an unexpectedly free week in 
> which to pull together a benchmark suite... clearly overoptimistic!  But I'll 
> happily pass these pointers on to anyone with the inclination to do it.


I'm definitely interested. If I want to make a strong case for my current 
research, I really need benchmarks that can be used. Additionally, coming up 
with a good suite, characterising it, can easily result is a decent paper, that 
is certain to be cited numerous times. I think it would have to be a 
group/community effort though. I've looked through the apps on the Haskell wiki 
pages, but there's not much usable there, imho. I'd like to illustrate this by 
the dacapo benchmark suite [2,3] example. It took a while, but now everybody in 
the Java camp is (or should be) using these benchmarks. Saying that we just do 
not want to do this, is simply not plausible to maintain. 


-- Andy


[1]  Computer systems are dynamical systems, Todd Mytkowicz, Amer Diwan, and 
Elizabeth Bradley, Chaos 19, 033124 (2009); doi:10.1063/1.3187791 (14 pages).
[2] The DaCapo benchmarks: java benchmarking development and analysis, Stephen 
Blackburn et al, OOPSLA 2006
[3] Wake up and smell the coffee: evaluation methodology for the 21st century, 
Stephen Blackburn et al, CACM 2008

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Reply via email to