On Tue, Jul 7, 2009 at 5:08 PM, Maciej Stachowiak <m...@apple.com> wrote:
> On Jul 7, 2009, at 4:19 PM, Peter Kasting wrote: > >> For example, the framework could compute both sums _and_ geomeans, if >> people thought both were valuable. >> > > That's a plausible thing to do, but I think there's a downside: if you make > a change that moves the two scores in opposite directions, the benchmark > doesn't help you decide if it's good or not. Avoiding paralysis in the face > of tradeoffs is part of the reason we look primarily at the total score, not > the individual subtest scores. The whole point of a meta-benchmark like this > is to force ourselves to simplemindedly look at only one number. Yes, I originally had more text like "deciding how to use these scores would be the hard part", and this is precisely why. I suppose that if different vendors wanted to use different criteria to determine what to do in the face of a tradeoff, the benchmark could simply be a data source, rather than a strong guide. But this would make it difficult to use the benchmark to compare engines, which is currently a key use of SunSpider (and is a key failing, IMO, of frameworks like Dromaeo that don't run identical code on every engine [IIRC]). I think there's one way in which sampling the Web is not quite right. To > some extent, what matters is not average density of an operation but peak > density. An operation that's used a *lot* by a few sites and hardly used by > most sites, may deserve a weighting above its average proportion of Web use. If I understand you right, the effect you're noting is that speeding up every web page by 1 ms might be a larger net win but a smaller perceived win than speeding up, say, Gmail alone by 100 ms. I think this is true. One way to capture this would be to say that at least part of the benchmark should concentrate on operations that are used in the inner loops of any of n popular websites, without regard to their overall frequency on the web. (Although perhaps the two correlate well and there aren't a lot of "rare but peaky" operations? I don't know.) > - GC load I second this. As people use more tabs and larger, more complex apps, the performance of an engine under heavier GC load becomes more relevant. It would be good to know what other things should be tested that are not > sufficiently covered. I think DOM bindings are hard to test and would benefit from benchmarking. No public benchmarks seem to test these well today. * - For example, Mozilla's TraceMonkey effort showed relatively little > improvement on the V8 benchmark, even though it showed significant > improvement on SunSpider and other benchmarks. I think TraceMonkey speedups > are real and significant, so this would tend to undermine my confidence in > the V8 benchmark's coverage. I agree that the V8 benchmark's coverage is inadequate and that the example you mention illuminates that, because TraceMonkey definitely performs better than SpiderMonkey in my own usage. I wonder if there may have been an opposite effect in a few cases where benchmarks with very simple tight loops improved _more_ under TM than "real-world code" did, but I think the answer to that is simply that benchmarks should be testing both kinds of code. PK
_______________________________________________ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev