Re: [webkit-dev] Iterating SunSpider
On Jul 7, 2009, at 8:50 PM, Geoffrey Garen wrote: I also don't buy your conclusion -- that if regular expressions account for 1% of JavaScript time on the Internet overall, they need not be optimized. I never said that. You said the regular expression test was most likely... the least relevant test in SunSpider. You said implementors' choice to optimize regular expressions because they were hot on SunSpider was not what we want to encourage. But maybe I misunderstood you. Do you think it was a good thing that SunSpider encouraged optimization of regular expressions? If so, do you think the same thing would have happened had SunSpider not used summation in calculating its scores? I suspect this line of questioning will not result in effective persuasion or useful information transfer. It comes off as kind of a gotcha question. My understanding of Mike's position is this: - The slowest test on the benchmark will become a focus of optimization regardless of scoring method (thus, I assume he does not really think regexp optimization efforts are an utter waste.) - During the period when JS engines had most things much faster than the state of things when SunSpider first came out, but hadn't yet extensively optimized regexps, the test gave a misleading and potentially unfair picture of overall performance. And this is a condition that could happen again in the future. I think this is a plausible position, but I don't entirely buy these arguments, and I don't think they outweigh the reasons we chose to use summation scoring. I think it's ultimately a judgment call, and unless we have new information to present, we don't need to drag out the conversation or call each to account on details of supporting arguments. Regards, Maciej ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
Re: [webkit-dev] Iterating SunSpider
On Mon, Jul 6, 2009 at 10:11 AM, Geoffrey Garen gga...@apple.com wrote: So, what you end up with is after a couple of years, the slowest test in the suite is the most significant part of the score. Further, I'll predict that the slowest test will most likely be the least relevant test, because the truly important parts of JS engines were already optimized. This has happened with Sunspider 0.9 - the regex portions of the test became the dominant factor, even though they were not nearly as prominent in the real world as they were in the benchmark. This leads to implementors optimizing for the benchmark - and that is not what we want to encourage. How did you determine that regex performance is not nearly as prominent in the real world? For a while regex was 20-30% of the benchmark on most browsers even though it didn't consume 20-30% of the time that browsers spent inside javascript. So, I determined this through profiling. If you profile your browser while browsing websites, you won't find that it spends 20-30% of its javascript execution time running regex (even with the old pcre). It's more like 1%. If this is true, then it's a shame to see this consume 20-30% of any benchmark, because it means the benchmark scoring is not indicative of the real world. Maybe I just disagree with the mix ever having been very representative? Or maybe it changed over time? I don't know because I can't go back in time :-) Perhaps one solution is to better document how a mix is chosen. I don't really want to make this a debate about regex and he-says/she-says how expensive it is. We should talk about the framework. If the framework is subject to this type of skew, where it can disproportionately weight a test, is that something we should avoid? Keep in mind I'm not recommending any change to existing SunSpider 0.9 - just changes to future versions. Maciej pointed out a case where he thought the geometric mean was worse; I think thats a fair consideration if you have the perfect benchmark with an exactly representative workload. But we don't have the ability make a perfectly representative benchmark workload, and even if we did it would change over time - eventually making the benchmark useless... Mike ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
Re: [webkit-dev] Iterating SunSpider
As I said, we can argue the mix of tests forever, but it is not useful. Yes, I would test using top-100 sites. In the future, if a benchmark claims to have a representative mix, it should document why. Right? Are you saying that you did see Regex as being such a high percentage of javascript code? If so, we're using very different mixes of content for our tests. Mike On Tue, Jul 7, 2009 at 3:08 PM, Geoffrey Garen gga...@apple.com wrote: So, I determined this through profiling. If you profile your browser while browsing websites, you won't find that it spends 20-30% of its javascript execution time running regex (even with the old pcre). What websites did you browse, and how did you choose them? Do you think your browsing is representative of all JavaScript applications? Geoff ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
Re: [webkit-dev] Iterating SunSpider
On Sat, Jul 4, 2009 at 3:27 PM, Maciej Stachowiak m...@apple.com wrote: On Jul 4, 2009, at 11:47 AM, Mike Belshe wrote: I'd like to understand what's going to happen with SunSpider in the future. Here is a set of questions and criticisms. I'm interested in how these can be addressed. There are 3 areas I'd like to see improved in SunSpider, some of which we've discussed before: #1: SunSpider is currently version 0.9. Will SunSpider ever change? Or is it static? I believe that benchmarks need to be able to move with the times. As JS Engines change and improve, and as new areas are needed to be benchmarked, we need to be able to roll the version, fix bugs, and benchmark new features. The SunSpider version has not changed for ~2yrs. How can we change this situation? Are there plans for a new version already underway? I've been thinking about updating SunSpider for some time. There are two categories of changes I've thought about: 1) Quality-of-implementation changes to the harness. Among these might be ability to use the harness with multiple test sets. That would be 1.0. 2) An updated set of tests - the current tests are too short, and don't adequately cover some areas of the language. I'd like to make the tests take at least 100ms each on modern browsers on recent hardware. I'd also be interested in incorporating some of the tests from the v8 benchmark suite, if the v8 developers were ok with this. That would be SunSpider 2.0. The reason I've been hesitant to make any changes is that the press and independent analysts latched on to SunSpider as a way of comparing JavaScript implementations. Originally, it was primarily intended to be a tool for the WebKit team to help us make our JavaScript faster. However, now that third parties are relying it, there are two things I want to be really careful about: a) I don't want to invalidate people's published data, so significant changes to the test content would need to be published as a clearly separate version. b) I want to avoid accidentally or intentionally making changes that are biased in favor of Safari or WebKit-based browsers in general, or that even give that impression. That would hurt the test's credibility. When we first made SunSpider, Safari actually didn't do that great on it, which I think helped people believe that the test wasn't designed to make us look good, it was designed to be a relatively unbiased comparison. Thus, any change to the content would need to be scrutinized in some way. I'm not sure what it would take to get widespread agreement that a 2.0 content set is fair, but I agree it's time to make one soonish (before the end of the year probably). Thoughts on this are welcome. #2: Use of summing as a scoring mechanism is problematic Unfortunately, the sum-based scoring techniques do not withstand the test of time as browsers improve. When the benchmark was first introduced, each test was equally weighted and reasonably large. Over time, however, the test becomes dominated by the slowest tests - basically the weighting of the individual tests is variable based on the performance of the JS engine under test. Today's engines spend ~50% of their time on just string and date tests. The other tests are largely irrelevant at this point, and becoming less relevant every day. Eventually many of the tests will take near-zero time, and the benchmark will have to be scrapped unless we figure out a better way to score it. Benchmarking research which long pre-dates SunSpider confirms that geometric means provide a better basis for comparison: http://portal.acm.org/citation.cfm?id=5673 Can future versions of the SunSpider driver be made so that they won't become irrelevant over time? Use of summation instead of geometric mean was a considered choice. The intent is that engines should focus on whatever is slowest. A simplified example: let's say it's estimated that likely workload in the field will consist of 50% Operation A, and 50% of Operation B, and I can benchmark them in isolation. Now let's say implementation in Foo these operations are equally fast, while in implementation Bar, Operation A is 4x as fast as in Foo, while Operation B is 4x as slow as in Foo. A comparison by geometric means would imply that Foo and Bar are equally good, but Bar would actually be twice as slow on the intended workload. BTW - the way to work around this is to have enough sub-benchmarks such that this just doesn't happen. If we have the right test coverage, it seems unlikely to me that a code change would dramatically improve exactly one test at an exponential expense of exactly one other test. I'm not saying it is impossible - just that code changes don't generally cause that behavior. To combat this we can implement a broader base of benchmarks as well as longer-running tests that are not too micro. This brings up another problem with summation. The only case
Re: [webkit-dev] Iterating SunSpider
I'm more verbose than Mike, but it seems like people are talking past each other. On Tue, Jul 7, 2009 at 3:25 PM, Oliver Hunt oli...@apple.com wrote: If we see one section of the test taking dramatically longer than another then we can assume that we have not been paying enough attention to performance in that area, It depends on what your goal with perf is. If the goal is to balance optimizations such that operation A always consumes the same time as operation B, you are correct. But is this always best? The current design says yes. The open question is whether that is the best possible design. On Tue, Jul 7, 2009 at 3:58 PM, Geoffrey Garen gga...@apple.com wrote: I also don't buy your conclusion -- that if regular expressions account for 1% of JavaScript time on the Internet overall, they need not be optimized. I didn't see Mike say that regexes did not need to be optimized. If given an operation that occurs 20% of the time and another that occurs 1% of the time, I certainly think it _might_ be appropriate to spend more engineering effort on optimizing the first operation. Knowing for sure depends on how much you value the rarer cases, for reasons such as you give next: Second, it's important for all web apps to be fast in WebKit -- not just the ones that do what's common overall. Third, we want to enable not only the web applications of today, but also the web applications of tomorrow. I strongly agree with these principles, but I don't see why the current design necessarily does a better job of preserving them than all other designs. For example, let's say at the time SunSpider was created (and everything was roughly equal-weighted) that one of the subtests tested a horribly slow operation that would greatly benefit future web apps if it improved substantially. Unfortunately, the original equal-weighting enshrines the slowness of this operation, relative to the others being tested, such that if you begin to make it faster, the subtests become unbalanced and you conclude that no further work on it is needed for the time being. This is a suboptimal outcome. So in general, the question is: when some operation is slower than others, what criteria can we use to make the best decisions about where to spend developer effort? Surely our greatest cost here is opportunity cost. I accept Maciej's statement that the current design was intentional. I also accept that sums and geomeans each have drawbacks in guiding decision-making. I simply want to focus on finding the best possible design for the framework. For example, the framework could compute both sums _and_ geomeans, if people thought both were valuable. We could agree on a way of benchmarking a representative sample of current sites to get an idea of how widespread certain operations currently are. We could talk with the maintainers of jQuery, Dojo, etc. to see what sorts of operations they think would be helpful to future apps to make faster. We could instrument browsers to have some sort of (opt-in) sampling of real-world workloads. etc. Surely together we can come up with ways to make Sunspider even better, while keeping its current strengths in mind. PK ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
Re: [webkit-dev] Iterating SunSpider
On Jul 7, 2009, at 4:01 PM, Mike Belshe wrote: I'd like benchmarks to: a) have meaning even as browsers change over time b) evolve. as new areas of JS (or whatever) become important, the benchmark should have facilities to include that. Fair? Good? Bad? I think we can't rule out the possibility of a benchmark becoming less meaningful over time. I do think that we should eventually produce a new and rebalanced set of test content. I think it's fair to say that time is approaching for SunSpider. In particular, I don't think geometric means are a magic bullet. When SunSpider was first created, regexps were a small proportion of the total execution in what were the fastest publicly available at the time. Eventually, everything else got much faster. So at some point, SunSpider said it might be a good idea to quadruple the speed of regexp matching now. But if it used a geometric mean, it would always say it's a good idea to quadruple the speed of regexp matching, unless it omitted regexp tests entirely. From any starting point, and regardless of speed of other facilities, speeding up regexps by a factor of N would always show the same improvement in your overall score. SunSpider, on the other hand, was deliberately designed to highlight the area where an engine most needs improvement. I think the only real way to deal with this is to periodically revise and rebalance the benchmark. Regards, Maciej ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
Re: [webkit-dev] Iterating SunSpider
What you seem to think is better would be to repeatedly update sunspider everytime that something gets faster, ignoring entirely that the value in sunspider is precisely that it has not changed. Not quite what I'm saying :-) I'd like benchmarks to: a) have meaning even as browsers change over time b) evolve. as new areas of JS (or whatever) become important, the benchmark should have facilities to include that. Fair? Good? Bad? It's not unreasonable, but it can't be done on a whim, and changes cannot be made trivially. Both re-weighting sunspider and adding new tests as things are made faster is incredibly hard to do soundly because it becomes easy to end up obscuring meaningful data. In the context of regex for example, say sunspider had been reweighted for the current generation on js engines before anyone had looked at regex. Regex would not have stood out as being substantially slower, and would likely not have been investigated resulting in everyone having regex an order of magnitude slower than current engines. That's why sunspider has not been updated: after what a year and a half (?) it can still show areas where performance can be improved and while it does that it's still useful. So determining when it is sensible to update sunspider is difficult, you may be right, and find rebalancing shows new areas where performance can be improved, but if you're wrong you run the risk of changing the benchmark from something that is actually useful development tool into something that is only useful for producing a number at the end. If we see one section of the test taking dramatically longer than another then we can assume that we have not been paying enough attention to performance in that area, this is how we orginally noticed just how slow the regex engine was. If we had been continually rebalancing the test over and over again we would not have noticed this or other areas where performance could be (and has) improved. It would also break sunspider as a means for tracking and/or preventing performance regressions. Of course, using old versions of the benchmark for regression testing is not prohibited by iterating a benchmark. But what happens when the benchmarks disagree as to what is the improvement? You can't improve performance with one benchmark while testing for regressions with another. --Oliver___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
Re: [webkit-dev] Iterating SunSpider
On Tue, Jul 7, 2009 at 4:20 PM, Maciej Stachowiak m...@apple.com wrote: On Jul 7, 2009, at 4:01 PM, Mike Belshe wrote: I'd like benchmarks to: a) have meaning even as browsers change over time b) evolve. as new areas of JS (or whatever) become important, the benchmark should have facilities to include that. Fair? Good? Bad? I think we can't rule out the possibility of a benchmark becoming less meaningful over time. I do think that we should eventually produce a new and rebalanced set of test content. I think it's fair to say that time is approaching for SunSpider. I certainly agree that updating the benchmark over time is necessary :-) In particular, I don't think geometric means are a magic bullet. Yes, using a geometric mean does not mean that you never need to update the test suite. But it does give you a lot of mileage :-) And I think its closer to an industry standard than anything else (spec.org). When SunSpider was first created, regexps were a small proportion of the total execution in what were the fastest publicly available at the time. Eventually, everything else got much faster. So at some point, SunSpider said it might be a good idea to quadruple the speed of regexp matching now. But if it used a geometric mean, it would always say it's a good idea to quadruple the speed of regexp matching, unless it omitted regexp tests entirely. From any starting point, and regardless of speed of other facilities, speeding up regexps by a factor of N would always show the same improvement in your overall score. SunSpider, on the other hand, was deliberately designed to highlight the area where an engine most needs improvement. I don't think the optimization of regex would have been effected by using a different scoring mechanism. In both scoring methods, the score of the slowest test is the best pick for improving your overall score. So vendors would still need to optimize it to keep up. Mike I think the only real way to deal with this is to periodically revise and rebalance the benchmark. Regards, Maciej ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
Re: [webkit-dev] Iterating SunSpider
On Jul 7, 2009, at 4:19 PM, Peter Kasting wrote: For example, the framework could compute both sums _and_ geomeans, if people thought both were valuable. That's a plausible thing to do, but I think there's a downside: if you make a change that moves the two scores in opposite directions, the benchmark doesn't help you decide if it's good or not. Avoiding paralysis in the face of tradeoffs is part of the reason we look primarily at the total score, not the individual subtest scores. The whole point of a meta-benchmark like this is to force ourselves to simplemindedly look at only one number. We could agree on a way of benchmarking a representative sample of current sites to get an idea of how widespread certain operations currently are. We could talk with the maintainers of jQuery, Dojo, etc. to see what sorts of operations they think would be helpful to future apps to make faster. We could instrument browsers to have some sort of (opt-in) sampling of real-world workloads. etc. Surely together we can come up with ways to make Sunspider even better, while keeping its current strengths in mind. I think these are all good ideas. I think there's one way in which sampling the Web is not quite right. To some extent, what matters is not average density of an operation but peak density. An operation that's used a *lot* by a few sites and hardly used by most sites, may deserve a weighting above its average proportion of Web use. I would like to hear input on what is inadequately covered. I tend to think there should be more coverage of the following: - property access, involving at least some polymorphic access patterns - method calls - object-oriented programming patterns - GC load - programming in a style that makes significant use of closures I think the V8 benchmark does a much better job of covering the first four of these things. I also think it overweights them, to the exclusion of most other considerations(*). As I mentioned before, I'd like to include some of V8's tests in a future SunSpider 2.0 content set. It would be good to know what other things should be tested that are not sufficiently covered. Regards, Maciej * - For example, Mozilla's TraceMonkey effort showed relatively little improvement on the V8 benchmark, even though it showed significant improvement on SunSpider and other benchmarks. I think TraceMonkey speedups are real and significant, so this would tend to undermine my confidence in the V8 benchmark's coverage. Note: I don't mean to start a side thread about whether the V8 benchmark is good or not, I just wanted to justify my remarks above. ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
Re: [webkit-dev] Iterating SunSpider
On Tue, Jul 7, 2009 at 5:08 PM, Maciej Stachowiak m...@apple.com wrote: On Jul 7, 2009, at 4:19 PM, Peter Kasting wrote: For example, the framework could compute both sums _and_ geomeans, if people thought both were valuable. That's a plausible thing to do, but I think there's a downside: if you make a change that moves the two scores in opposite directions, the benchmark doesn't help you decide if it's good or not. Avoiding paralysis in the face of tradeoffs is part of the reason we look primarily at the total score, not the individual subtest scores. The whole point of a meta-benchmark like this is to force ourselves to simplemindedly look at only one number. Yes, I originally had more text like deciding how to use these scores would be the hard part, and this is precisely why. I suppose that if different vendors wanted to use different criteria to determine what to do in the face of a tradeoff, the benchmark could simply be a data source, rather than a strong guide. But this would make it difficult to use the benchmark to compare engines, which is currently a key use of SunSpider (and is a key failing, IMO, of frameworks like Dromaeo that don't run identical code on every engine [IIRC]). I think there's one way in which sampling the Web is not quite right. To some extent, what matters is not average density of an operation but peak density. An operation that's used a *lot* by a few sites and hardly used by most sites, may deserve a weighting above its average proportion of Web use. If I understand you right, the effect you're noting is that speeding up every web page by 1 ms might be a larger net win but a smaller perceived win than speeding up, say, Gmail alone by 100 ms. I think this is true. One way to capture this would be to say that at least part of the benchmark should concentrate on operations that are used in the inner loops of any of n popular websites, without regard to their overall frequency on the web. (Although perhaps the two correlate well and there aren't a lot of rare but peaky operations? I don't know.) - GC load I second this. As people use more tabs and larger, more complex apps, the performance of an engine under heavier GC load becomes more relevant. It would be good to know what other things should be tested that are not sufficiently covered. I think DOM bindings are hard to test and would benefit from benchmarking. No public benchmarks seem to test these well today. * - For example, Mozilla's TraceMonkey effort showed relatively little improvement on the V8 benchmark, even though it showed significant improvement on SunSpider and other benchmarks. I think TraceMonkey speedups are real and significant, so this would tend to undermine my confidence in the V8 benchmark's coverage. I agree that the V8 benchmark's coverage is inadequate and that the example you mention illuminates that, because TraceMonkey definitely performs better than SpiderMonkey in my own usage. I wonder if there may have been an opposite effect in a few cases where benchmarks with very simple tight loops improved _more_ under TM than real-world code did, but I think the answer to that is simply that benchmarks should be testing both kinds of code. PK ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
Re: [webkit-dev] Iterating SunSpider
On Tue, Jul 7, 2009 at 7:01 PM, Maciej Stachowiak m...@apple.com wrote: On Jul 7, 2009, at 6:43 PM, Mike Belshe wrote: (There are other benchmarks that use summation, for example iBench, though I am not sure these are examples of excellent benchmarks. Any benchmark that consists of a single test also implicitly uses summation. I'm not sure what other benchmarks do is as relevant of the technical merits.) Hehe - I don't think anyone has iBench except apple :-) This is now extremely tangential to the original point, but iBench is available to the general public here: http://www.lionbridge.com/lionbridge/en-US/services/software-product-engineering/testing-veritest/benchmark-software.htm Thanks! A lot of research has been put into benchmarking over the years; there is good reason for these choices, and they aren't arbitrary. I have not seen research indicating that summing of scores is statistically useful, but there are plenty that have chosen geometric means. I think we're starting to repeat our positions at this point, without adding new information or really persuading each other. If you have research that shows statistical benefits to geometric mean scoring, or other new information to add, I would welcome it. Only what is already on this thread or google for geometric mean benchmark. Mike Regards, Maciej ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
Re: [webkit-dev] Iterating SunSpider
Hi, Can future versions of the SunSpider driver be made so that they won't become irrelevant over time? I feel the weighting is more of an issue here than the total runtime. Eventually some tests become dominant, and the gain (or loss) on them almost determine the final results. Besides, there was a discussion about SunSpider enhancements a year ago. We collected some new JS benchmarks and put it into an WindScorpion (it is another name of SunSpider) extension package. However, the topic died away after a short time. Zoltan ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
Re: [webkit-dev] Iterating SunSpider
On Jul 6, 2009, at 10:11 AM, Geoffrey Garen wrote: So, what you end up with is after a couple of years, the slowest test in the suite is the most significant part of the score. Further, I'll predict that the slowest test will most likely be the least relevant test, because the truly important parts of JS engines were already optimized. This has happened with Sunspider 0.9 - the regex portions of the test became the dominant factor, even though they were not nearly as prominent in the real world as they were in the benchmark. This leads to implementors optimizing for the benchmark - and that is not what we want to encourage. How did you determine that regex performance is not nearly as prominent in the real world? For reference: in current JavaScriptCore, the one regexp-centric test is about 4.6% of the score by time. 3 of the string tests also spend their time in regexps, however, I think those are among the tests that most closely resemble what Web sites do. I believe the situation is roughly similar in other competitive JavaScript engines. This is probably not exactly proportionate but it doesn't dominate the test. I don't think any of this is a problem, unless one thinks the regexp improvements in Nitro, V8 and TraceMonkey were a waste of resources. What I have seen happen is that numeric processing and especially integer math became a smaller and smaller proportion of the test, looking at the best publicly available engines over time. I think that turned out to be the case because math had much more room for optimization in naive implementations than, say, string processing. Regards, Maciej ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
Re: [webkit-dev] Iterating SunSpider
On Sat, Jul 4, 2009 at 3:27 PM, Maciej Stachowiak m...@apple.com wrote: On Jul 4, 2009, at 11:47 AM, Mike Belshe wrote: I'd like to understand what's going to happen with SunSpider in the future. Here is a set of questions and criticisms. I'm interested in how these can be addressed. There are 3 areas I'd like to see improved in SunSpider, some of which we've discussed before: #1: SunSpider is currently version 0.9. Will SunSpider ever change? Or is it static? I believe that benchmarks need to be able to move with the times. As JS Engines change and improve, and as new areas are needed to be benchmarked, we need to be able to roll the version, fix bugs, and benchmark new features. The SunSpider version has not changed for ~2yrs. How can we change this situation? Are there plans for a new version already underway? I've been thinking about updating SunSpider for some time. There are two categories of changes I've thought about: 1) Quality-of-implementation changes to the harness. Among these might be ability to use the harness with multiple test sets. That would be 1.0. Cool 2) An updated set of tests - the current tests are too short, and don't adequately cover some areas of the language. I'd like to make the tests take at least 100ms each on modern browsers on recent hardware. I'd also be interested in incorporating some of the tests from the v8 benchmark suite, if the v8 developers were ok with this. That would be SunSpider 2.0. Cool. Use of v8 tests is just fine; they're all open source. The reason I've been hesitant to make any changes is that the press and independent analysts latched on to SunSpider as a way of comparing JavaScript implementations. Originally, it was primarily intended to be a tool for the WebKit team to help us make our JavaScript faster. However, now that third parties are relying it, there are two things I want to be really careful about: a) I don't want to invalidate people's published data, so significant changes to the test content would need to be published as a clearly separate version. Of course. Small UI nit - the current SunSpider benchmark doesn't make the version very prominent at all. It would be nice to make it more salient. b) I want to avoid accidentally or intentionally making changes that are biased in favor of Safari or WebKit-based browsers in general, or that even give that impression. That would hurt the test's credibility. When we first made SunSpider, Safari actually didn't do that great on it, which I think helped people believe that the test wasn't designed to make us look good, it was designed to be a relatively unbiased comparison. Of course. Thus, any change to the content would need to be scrutinized in some way. I'm not sure what it would take to get widespread agreement that a 2.0 content set is fair, but I agree it's time to make one soonish (before the end of the year probably). Thoughts on this are welcome. #2: Use of summing as a scoring mechanism is problematic Unfortunately, the sum-based scoring techniques do not withstand the test of time as browsers improve. When the benchmark was first introduced, each test was equally weighted and reasonably large. Over time, however, the test becomes dominated by the slowest tests - basically the weighting of the individual tests is variable based on the performance of the JS engine under test. Today's engines spend ~50% of their time on just string and date tests. The other tests are largely irrelevant at this point, and becoming less relevant every day. Eventually many of the tests will take near-zero time, and the benchmark will have to be scrapped unless we figure out a better way to score it. Benchmarking research which long pre-dates SunSpider confirms that geometric means provide a better basis for comparison: http://portal.acm.org/citation.cfm?id=5673 Can future versions of the SunSpider driver be made so that they won't become irrelevant over time? Use of summation instead of geometric mean was a considered choice. The intent is that engines should focus on whatever is slowest. A simplified example: let's say it's estimated that likely workload in the field will consist of 50% Operation A, and 50% of Operation B, and I can benchmark them in isolation. Now let's say implementation in Foo these operations are equally fast, while in implementation Bar, Operation A is 4x as fast as in Foo, while Operation B is 4x as slow as in Foo. A comparison by geometric means would imply that Foo and Bar are equally good, but Bar would actually be twice as slow on the intended workload. I could almost buy this if: a) we had a really really representative workload of what web pages do, broken down into the exactly correct proportions. b) the representative workload remains representative over time. I'll argue that we'll never be very good at (a), and that (b) is impossible. So, what
Re: [webkit-dev] Iterating SunSpider
Maciej Stachowiak wrote: I think the pauses were large in an attempt to get stable, repeatable results, but are probably longer than necessary to achieve this. I agree with you that the artifacts in balanced power mode are a problem. Do you know what timer thresholds avoid the effect? I think this would be a reasonable 1.0 kind of change. Just a gut feeling, but I suspect the exact throttling algorithm would vary too much from machine to machine and OS version to OS version to ever find a good threshold to avoid it. The best thing to do would be to have the harness turn off CPU throttling when it starts. (This is possible from the commandline under Linux, and I assume in Mac, but Windows might be a problem.) Joe ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
Re: [webkit-dev] Iterating SunSpider
On 4-Jul-09, at 2:47 PM, Mike Belshe wrote: #2: Use of summing as a scoring mechanism is problematic Unfortunately, the sum-based scoring techniques do not withstand the test of time as browsers improve. When the benchmark was first introduced, each test was equally weighted and reasonably large. Over time, however, the test becomes dominated by the slowest tests - basically the weighting of the individual tests is variable based on the performance of the JS engine under test. Today's engines spend ~50% of their time on just string and date tests. The other tests are largely irrelevant at this point, and becoming less relevant every day. Eventually many of the tests will take near- zero time, and the benchmark will have to be scrapped unless we figure out a better way to score it. Benchmarking research which long pre-dates SunSpider confirms that geometric means provide a better basis for comparison: http://portal.acm.org/citation.cfm? id=5673 Can future versions of the SunSpider driver be made so that they won't become irrelevant over time? Actually this doesn't happen on all CPUs. For example CPUs without FPU have very different results. memory performance is also a big factor. #3: The SunSpider harness has a variance problem due to CPU power savings modes. Because the test runs a tiny amount of Javascript (often under 10ms) followed by a 500ms sleep, CPUs will go into power savings modes between test runs. This radically changes the performance measurements and makes it so that comparison between two runs is dependent on the user's power savings mode. To demonstrate this, run SunSpider on two machines- one with the Windows balanced (default) setting for power, and then again with high performance. It's easy to see skews of 30% between these two modes. I think we should change the test harness to avoid such accidental effects. I've noticed this issue too. -- George Staikos Torch Mobile Inc. http://www.torchmobile.com/ ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
Re: [webkit-dev] Iterating SunSpider
On Sat, Jul 4, 2009 at 11:47 AM, Mike Belshe m...@belshe.com wrote: #3: The SunSpider harness has a variance problem due to CPU power savings modes. This one worries me because it decreases the consistency/reproducibility of test scores and makes it harder to compare engines or to track one engine's scores over time. For example, doing a bunch of CPU work just before running the benchmark can affect whether and when the CPU throttles down during the benchmark run. Possible solution: The dromaeo test suite already incorporates the SunSpider individual tests under a new benchmark harness which fixes all 3 of the above issues. Thus, one approach would be to retire SunSpider 0.9 in favor of Dromaeo. http://dromaeo.com/?sunspider Dromaeo has also done a lot of good work to ensure statistical significance of the results. Once we have a better benchmarking framework, it would be great to build a new microbenchmark mix which more realistically exercises today's JavaScript. One complaint I have heard about the Dromaeo tests (not the harness) is that the actual JS that gets run differs from browser to browser (e.g. because it is a direct copy of a source library that does UA sniffing). If this is true it means that this suite as-is isn't useful to compare engines to each other. However, the Dromaeo _harness_ is probably a win as-is. Of course, changing anything about Sunspider raises the question of tracking historical performance. Perhaps the harness could support versioning, or perhaps people are simply willing to say Sunspider 1.0 scores cannot be compared to Sunspider 0.9 scores. I believe this is the approach the V8 benchmark takes. PK ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
Re: [webkit-dev] Iterating SunSpider
On Jul 4, 2009, at 11:47 AM, Mike Belshe wrote: I'd like to understand what's going to happen with SunSpider in the future. Here is a set of questions and criticisms. I'm interested in how these can be addressed. There are 3 areas I'd like to see improved in SunSpider, some of which we've discussed before: #1: SunSpider is currently version 0.9. Will SunSpider ever change? Or is it static? I believe that benchmarks need to be able to move with the times. As JS Engines change and improve, and as new areas are needed to be benchmarked, we need to be able to roll the version, fix bugs, and benchmark new features. The SunSpider version has not changed for ~2yrs. How can we change this situation? Are there plans for a new version already underway? I've been thinking about updating SunSpider for some time. There are two categories of changes I've thought about: 1) Quality-of-implementation changes to the harness. Among these might be ability to use the harness with multiple test sets. That would be 1.0. 2) An updated set of tests - the current tests are too short, and don't adequately cover some areas of the language. I'd like to make the tests take at least 100ms each on modern browsers on recent hardware. I'd also be interested in incorporating some of the tests from the v8 benchmark suite, if the v8 developers were ok with this. That would be SunSpider 2.0. The reason I've been hesitant to make any changes is that the press and independent analysts latched on to SunSpider as a way of comparing JavaScript implementations. Originally, it was primarily intended to be a tool for the WebKit team to help us make our JavaScript faster. However, now that third parties are relying it, there are two things I want to be really careful about: a) I don't want to invalidate people's published data, so significant changes to the test content would need to be published as a clearly separate version. b) I want to avoid accidentally or intentionally making changes that are biased in favor of Safari or WebKit-based browsers in general, or that even give that impression. That would hurt the test's credibility. When we first made SunSpider, Safari actually didn't do that great on it, which I think helped people believe that the test wasn't designed to make us look good, it was designed to be a relatively unbiased comparison. Thus, any change to the content would need to be scrutinized in some way. I'm not sure what it would take to get widespread agreement that a 2.0 content set is fair, but I agree it's time to make one soonish (before the end of the year probably). Thoughts on this are welcome. #2: Use of summing as a scoring mechanism is problematic Unfortunately, the sum-based scoring techniques do not withstand the test of time as browsers improve. When the benchmark was first introduced, each test was equally weighted and reasonably large. Over time, however, the test becomes dominated by the slowest tests - basically the weighting of the individual tests is variable based on the performance of the JS engine under test. Today's engines spend ~50% of their time on just string and date tests. The other tests are largely irrelevant at this point, and becoming less relevant every day. Eventually many of the tests will take near- zero time, and the benchmark will have to be scrapped unless we figure out a better way to score it. Benchmarking research which long pre-dates SunSpider confirms that geometric means provide a better basis for comparison: http://portal.acm.org/citation.cfm?id=5673 Can future versions of the SunSpider driver be made so that they won't become irrelevant over time? Use of summation instead of geometric mean was a considered choice. The intent is that engines should focus on whatever is slowest. A simplified example: let's say it's estimated that likely workload in the field will consist of 50% Operation A, and 50% of Operation B, and I can benchmark them in isolation. Now let's say implementation in Foo these operations are equally fast, while in implementation Bar, Operation A is 4x as fast as in Foo, while Operation B is 4x as slow as in Foo. A comparison by geometric means would imply that Foo and Bar are equally good, but Bar would actually be twice as slow on the intended workload. Of course, doing this requires a judgment call on reasonable balance of different kinds of code, and that balance needs to be re-evaluated periodically. But tests based on geometric means also make an implied judgment call. The operations comprising each individual test are added linearly. The test then judges that these particular combinations are each equally important. #3: The SunSpider harness has a variance problem due to CPU power savings modes. Because the test runs a tiny amount of Javascript (often under 10ms) followed by a 500ms sleep, CPUs
Re: [webkit-dev] Iterating SunSpider
On Jul 4, 2009, at 1:06 PM, Peter Kasting wrote: On Sat, Jul 4, 2009 at 11:47 AM, Mike Belshe m...@belshe.com wrote: #3: The SunSpider harness has a variance problem due to CPU power savings modes. This one worries me because it decreases the consistency/ reproducibility of test scores and makes it harder to compare engines or to track one engine's scores over time. For example, doing a bunch of CPU work just before running the benchmark can affect whether and when the CPU throttles down during the benchmark run. Possible solution: The dromaeo test suite already incorporates the SunSpider individual tests under a new benchmark harness which fixes all 3 of the above issues. Thus, one approach would be to retire SunSpider 0.9 in favor of Dromaeo. http://dromaeo.com/?sunspider Dromaeo has also done a lot of good work to ensure statistical significance of the results. Once we have a better benchmarking framework, it would be great to build a new microbenchmark mix which more realistically exercises today's JavaScript. One complaint I have heard about the Dromaeo tests (not the harness) is that the actual JS that gets run differs from browser to browser (e.g. because it is a direct copy of a source library that does UA sniffing). If this is true it means that this suite as-is isn't useful to compare engines to each other. However, the Dromaeo _harness_ is probably a win as-is. Of course, changing anything about Sunspider raises the question of tracking historical performance. Perhaps the harness could support versioning, or perhaps people are simply willing to say Sunspider 1.0 scores cannot be compared to Sunspider 0.9 scores. I believe this is the approach the V8 benchmark takes. I think versioning the test content is right, and I think we should do that over time. I think a harness change to avoid triggering powersaving mode on Windows would be a reasonable thing to do to the harness without a version change. I don't think Dromaeo is a good choice of harness - I don't think their results are stable enough and I am not confident in the statistical soundness of their methodology. Regards, Maciej ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev