Re: [webkit-dev] Iterating SunSpider

2009-07-08 Thread Maciej Stachowiak


On Jul 7, 2009, at 8:50 PM, Geoffrey Garen wrote:

I also don't buy your conclusion -- that if regular expressions  
account for 1% of JavaScript time on the Internet overall, they  
need not be optimized.


I never said that.


You said the regular expression test was most likely... the least  
relevant test in SunSpider.


You said implementors' choice to optimize regular expressions  
because they were hot on SunSpider was not what we want to  
encourage.


But maybe I misunderstood you. Do you think it was a good thing that  
SunSpider encouraged optimization of regular expressions? If so, do  
you think the same thing would have happened had SunSpider not used  
summation in calculating its scores?


I suspect this line of questioning will not result in effective  
persuasion or useful information transfer. It comes off as kind of a  
gotcha question.


My understanding of Mike's position is this:

- The slowest test on the benchmark will become a focus of  
optimization regardless of scoring method (thus, I assume he does not  
really think regexp optimization efforts are an utter waste.)


- During the period when JS engines had most things much faster than  
the state of things when SunSpider first came out, but hadn't yet  
extensively optimized regexps, the test gave a misleading and  
potentially unfair picture of overall performance. And this is a  
condition that could happen again in the future.


I think this is a plausible position, but I don't entirely buy these  
arguments, and I don't think they outweigh the reasons we chose to use  
summation scoring. I think it's ultimately a judgment call, and unless  
we have new information to present, we don't need to drag out the  
conversation or call each to account on details of supporting arguments.


Regards,
Maciej

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] Iterating SunSpider

2009-07-07 Thread Mike Belshe
On Mon, Jul 6, 2009 at 10:11 AM, Geoffrey Garen gga...@apple.com wrote:

  So, what you end up with is after a couple of years, the slowest test in
 the suite is the most significant part of the score.  Further, I'll predict
 that the slowest test will most likely be the least relevant test, because
 the truly important parts of JS engines were already optimized.  This has
 happened with Sunspider 0.9 - the regex portions of the test became the
 dominant factor, even though they were not nearly as prominent in the real
 world as they were in the benchmark.  This leads to implementors optimizing
 for the benchmark - and that is not what we want to encourage.


 How did you determine that regex performance is not nearly as prominent in
 the real world?


For a while regex was 20-30% of the benchmark on most browsers even though
it didn't consume 20-30% of the time that browsers spent inside javascript.

So, I determined this through profiling.  If you profile your browser while
browsing websites, you won't find that it spends 20-30% of its javascript
execution time running regex (even with the old pcre).  It's more like 1%.
 If this is true, then it's a shame to see this consume 20-30% of any
benchmark, because it means the benchmark scoring is not indicative of the
real world.  Maybe I just disagree with the mix ever having been very
representative?  Or maybe it changed over time?  I don't know because I
can't go back in time :-)  Perhaps one solution is to better document how a
mix is chosen.

I don't really want to make this a debate about regex and he-says/she-says
how expensive it is.  We should talk about the framework.  If the framework
is subject to this type of skew, where it can disproportionately weight a
test, is that something we should avoid?

Keep in mind I'm not recommending any change to existing SunSpider 0.9 -
just changes to future versions.

Maciej pointed out a case where he thought the geometric mean was worse; I
think thats a fair consideration if you have the perfect benchmark with an
exactly representative workload.  But we don't have the ability make a
perfectly representative benchmark workload, and even if we did it would
change over time - eventually making the benchmark useless...

Mike
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] Iterating SunSpider

2009-07-07 Thread Mike Belshe
As I said, we can argue the mix of tests forever, but it is not useful.
 Yes, I would test using top-100 sites.  In the future, if a benchmark
claims to have a representative mix, it should document why.  Right?
Are you saying that you did see Regex as being such a high percentage of
javascript code?  If so, we're using very different mixes of content for our
tests.

Mike


On Tue, Jul 7, 2009 at 3:08 PM, Geoffrey Garen gga...@apple.com wrote:

 So, I determined this through profiling.  If you profile your browser while
 browsing websites, you won't find that it spends 20-30% of its javascript
 execution time running regex (even with the old pcre).


 What websites did you browse, and how did you choose them?

 Do you think your browsing is representative of all JavaScript
 applications?

 Geoff

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] Iterating SunSpider

2009-07-07 Thread Mike Belshe
On Sat, Jul 4, 2009 at 3:27 PM, Maciej Stachowiak m...@apple.com wrote:


 On Jul 4, 2009, at 11:47 AM, Mike Belshe wrote:

 I'd like to understand what's going to happen with SunSpider in the future.
  Here is a set of questions and criticisms.  I'm interested in how these can
 be addressed.

 There are 3 areas I'd like to see improved in
 SunSpider, some of which we've discussed before:


 #1: SunSpider is currently version 0.9.  Will SunSpider ever change?  Or is 
 it static?
 I believe that benchmarks need to be able to
 move with the times.  As JS Engines change and improve, and as new areas are 
 needed
 to be benchmarked, we need to be able to roll the version, fix bugs, and
 benchmark new features.  The SunSpider version has not changed for ~2yrs.
  How can we change this situation?  Are there plans for a new version
 already underway?


 I've been thinking about updating SunSpider for some time. There are two
 categories of changes I've thought about:

 1) Quality-of-implementation changes to the harness. Among these might be
 ability to use the harness with multiple test sets. That would be 1.0.

 2) An updated set of tests - the current tests are too short, and don't
 adequately cover some areas of the language. I'd like to make the tests take
 at least 100ms each on modern browsers on recent hardware. I'd also be
 interested in incorporating some of the tests from the v8 benchmark suite,
 if the v8 developers were ok with this. That would be SunSpider 2.0.

 The reason I've been hesitant to make any changes is that the press and
 independent analysts latched on to SunSpider as a way of comparing
 JavaScript implementations. Originally, it was primarily intended to be a
 tool for the WebKit team to help us make our JavaScript faster. However, now
 that third parties are relying it, there are two things I want to be really
 careful about:

 a) I don't want to invalidate people's published data, so significant
 changes to the test content would need to be published as a clearly separate
 version.

 b) I want to avoid accidentally or intentionally making changes that are
 biased in favor of Safari or WebKit-based browsers in general, or that even
 give that impression. That would hurt the test's credibility. When we first
 made SunSpider, Safari actually didn't do that great on it, which I think
 helped people believe that the test wasn't designed to make us look good, it
 was designed to be a relatively unbiased comparison.

 Thus, any change to the content would need to be scrutinized in some way.
 I'm not sure what it would take to get widespread agreement that a 2.0
 content set is fair, but I agree it's time to make one soonish (before the
 end of the year probably). Thoughts on this are welcome.


 #2: Use of summing as a scoring mechanism is problematic
 Unfortunately, the sum-based scoring techniques do not withstand the test
 of time as browsers improve.  When the benchmark was first introduced, each
 test was equally weighted and reasonably large.  Over time, however, the
 test becomes dominated by the slowest tests - basically the weighting of the
 individual tests is variable based on the performance of the JS engine under
 test.  Today's engines spend ~50% of their time on just string and date
 tests.  The other tests are largely irrelevant at this point, and becoming
 less relevant every day.  Eventually many of the tests will take near-zero
 time, and the benchmark will have to be scrapped unless we figure out a
 better way to score it.  Benchmarking research which long pre-dates
 SunSpider confirms that geometric means provide a better basis for
 comparison:  http://portal.acm.org/citation.cfm?id=5673 Can future
 versions of the SunSpider driver be made so that they won't become
 irrelevant over time?


 Use of summation instead of geometric mean was a considered choice. The
 intent is that engines should focus on whatever is slowest. A simplified
 example: let's say it's estimated that likely workload in the field will
 consist of 50% Operation A, and 50% of Operation B, and I can benchmark them
 in isolation. Now let's say implementation in Foo these operations are
 equally fast, while in implementation Bar, Operation A is 4x as fast as in
 Foo, while Operation B is 4x as slow as in Foo. A comparison by geometric
 means would imply that Foo and Bar are equally good, but Bar would actually
 be twice as slow on the intended workload.


BTW - the way to work around this is to have enough sub-benchmarks such that
this just doesn't happen.  If we have the right test coverage, it seems
unlikely to me that a code change would dramatically improve exactly one
test at an exponential expense of exactly one other test.  I'm not saying it
is impossible - just that code changes don't generally cause that behavior.
 To combat this we can implement a broader base of benchmarks as well as
longer-running tests that are not too micro.

This brings up another problem with summation.  The only case 

Re: [webkit-dev] Iterating SunSpider

2009-07-07 Thread Peter Kasting
I'm more verbose than Mike, but it seems like people are talking past each
other.

On Tue, Jul 7, 2009 at 3:25 PM, Oliver Hunt oli...@apple.com wrote:

 If we see one section of the test taking dramatically longer than another
 then we can assume that we have not been paying enough attention to
 performance in that area,


It depends on what your goal with perf is.  If the goal is to balance
optimizations such that operation A always consumes the same time as
operation B, you are correct.  But is this always best?  The current design
says yes.  The open question is whether that is the best possible design.

On Tue, Jul 7, 2009 at 3:58 PM, Geoffrey Garen gga...@apple.com wrote:

 I also don't buy your conclusion -- that if regular expressions account for
 1% of JavaScript time on the Internet overall, they need not be optimized.


I didn't see Mike say that regexes did not need to be optimized.

If given an operation that occurs 20% of the time and another that occurs 1%
of the time, I certainly think it _might_ be appropriate to spend more
engineering effort on optimizing the first operation.  Knowing for sure
depends on how much you value the rarer cases, for reasons such as you give
next:

Second, it's important for all web apps to be fast in WebKit -- not just the
 ones that do what's common overall. Third, we want to enable not only the
 web applications of today, but also the web applications of tomorrow.


I strongly agree with these principles, but I don't see why the current
design necessarily does a better job of preserving them than all other
designs.  For example, let's say at the time SunSpider was created (and
everything was roughly equal-weighted) that one of the subtests tested a
horribly slow operation that would greatly benefit future web apps if it
improved substantially.  Unfortunately, the original equal-weighting
enshrines the slowness of this operation, relative to the others being
tested, such that if you begin to make it faster, the subtests become
unbalanced and you conclude that no further work on it is needed for the
time being.  This is a suboptimal outcome.

So in general, the question is: when some operation is slower than others,
what criteria can we use to make the best decisions about where to spend
developer effort?  Surely our greatest cost here is opportunity cost.

I accept Maciej's statement that the current design was intentional.  I also
accept that sums and geomeans each have drawbacks in guiding
decision-making.  I simply want to focus on finding the best possible design
for the framework.

For example, the framework could compute both sums _and_ geomeans, if people
thought both were valuable.  We could agree on a way of benchmarking a
representative sample of current sites to get an idea of how widespread
certain operations currently are.  We could talk with the maintainers of
jQuery, Dojo, etc. to see what sorts of operations they think would be
helpful to future apps to make faster.  We could instrument browsers to have
some sort of (opt-in) sampling of real-world workloads.  etc.  Surely
together we can come up with ways to make Sunspider even better, while
keeping its current strengths in mind.

PK
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] Iterating SunSpider

2009-07-07 Thread Maciej Stachowiak


On Jul 7, 2009, at 4:01 PM, Mike Belshe wrote:



I'd like benchmarks to:
a) have meaning even as browsers change over time
b) evolve.  as new areas of JS (or whatever) become important,  
the benchmark should have facilities to include that.


Fair?  Good? Bad?


I think we can't rule out the possibility of a benchmark becoming less  
meaningful over time. I do think that we should eventually produce a  
new and rebalanced set of test content. I think it's fair to say that  
time is approaching for SunSpider.


In particular, I don't think geometric means are a magic bullet. When  
SunSpider was first created, regexps were a small proportion of the  
total execution in what were the fastest publicly available at the  
time. Eventually, everything else got much faster. So at some point,  
SunSpider said it might be a good idea to quadruple the speed of  
regexp matching now. But if it used a geometric mean, it would always  
say it's a good idea to quadruple the speed of regexp matching, unless  
it omitted regexp tests entirely. From any starting point, and  
regardless of speed of other facilities, speeding up regexps by a  
factor of N would always show the same improvement in your overall  
score. SunSpider, on the other hand, was deliberately designed to  
highlight the area where an engine most needs improvement.


I think the only real way to deal with this is to periodically revise  
and rebalance the benchmark.


Regards,
Maciej

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] Iterating SunSpider

2009-07-07 Thread Oliver Hunt
What you seem to think is better would be to repeatedly update  
sunspider everytime that something gets faster, ignoring entirely  
that the value in sunspider is precisely that it has not changed.


Not quite what I'm saying :-)

I'd like benchmarks to:
a) have meaning even as browsers change over time
b) evolve.  as new areas of JS (or whatever) become important,  
the benchmark should have facilities to include that.


Fair?  Good? Bad?


It's not unreasonable, but it can't be done on a whim, and changes  
cannot be made trivially.  Both re-weighting sunspider and adding new  
tests as things are made faster is incredibly hard to do soundly  
because it becomes easy to end up obscuring meaningful data.


In the context of regex for example, say sunspider had been reweighted  
for the current generation on js engines before anyone had looked at  
regex.  Regex would not have stood out as being substantially slower,  
and would likely not have been investigated resulting in everyone  
having regex an order of magnitude slower than current engines.   
That's why sunspider has not been updated: after what a year and a  
half (?) it can still show areas where performance can be improved and  
while it does that it's still useful.


So determining when it is sensible to update sunspider is difficult,  
you may be right, and find rebalancing shows new areas where  
performance can be improved, but if you're wrong you run the risk of  
changing the benchmark from something that is actually useful  
development tool into something that is only useful for producing a  
number at the end.


If we see one section of the test taking dramatically longer than  
another then we can assume that we have not been paying enough  
attention to performance in that area, this is how we orginally  
noticed just how slow the regex engine was.  If we had been  
continually rebalancing the test over and over again we would not  
have noticed this or other areas where performance could be (and  
has) improved.  It would also break sunspider as a means for  
tracking and/or preventing performance regressions.


Of course, using old versions of the benchmark for regression  
testing is not prohibited by iterating a benchmark.


But what happens when the benchmarks disagree as to what is the  
improvement?  You can't improve performance with one benchmark while  
testing for regressions with another.


--Oliver___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] Iterating SunSpider

2009-07-07 Thread Mike Belshe
On Tue, Jul 7, 2009 at 4:20 PM, Maciej Stachowiak m...@apple.com wrote:


 On Jul 7, 2009, at 4:01 PM, Mike Belshe wrote:


 I'd like benchmarks to:
a) have meaning even as browsers change over time
b) evolve.  as new areas of JS (or whatever) become important, the
 benchmark should have facilities to include that.

 Fair?  Good? Bad?


 I think we can't rule out the possibility of a benchmark becoming less
 meaningful over time. I do think that we should eventually produce a new and
 rebalanced set of test content. I think it's fair to say that time is
 approaching for SunSpider.


I certainly agree that updating the benchmark over time is necessary :-)




 In particular, I don't think geometric means are a magic bullet.


Yes, using a geometric mean does not mean that you never need to update the
test suite.  But it does give you a lot of mileage :-)  And I think its
closer to an industry standard than anything else (spec.org).



 When SunSpider was first created, regexps were a small proportion of the
 total execution in what were the fastest publicly available at the time.
 Eventually, everything else got much faster. So at some point, SunSpider
 said it might be a good idea to quadruple the speed of regexp matching
 now. But if it used a geometric mean, it would always say it's a good idea
 to quadruple the speed of regexp matching, unless it omitted regexp tests
 entirely. From any starting point, and regardless of speed of other
 facilities, speeding up regexps by a factor of N would always show the same
 improvement in your overall score. SunSpider, on the other hand, was
 deliberately designed to highlight the area where an engine most needs
 improvement.


I don't think the optimization of regex would have been effected by using a
different scoring mechanism.  In both scoring methods, the score of the
slowest test is the best pick for improving your overall score.  So vendors
would still need to optimize it to keep up.


Mike




 I think the only real way to deal with this is to periodically revise and
 rebalance the benchmark.

 Regards,
 Maciej


___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] Iterating SunSpider

2009-07-07 Thread Maciej Stachowiak


On Jul 7, 2009, at 4:19 PM, Peter Kasting wrote:

For example, the framework could compute both sums _and_ geomeans,  
if people thought both were valuable.


That's a plausible thing to do, but I think there's a downside: if you  
make a change that moves the two scores in opposite directions, the  
benchmark doesn't help you decide if it's good or not. Avoiding  
paralysis in the face of tradeoffs is part of the reason we look  
primarily at the total score, not the individual subtest scores. The  
whole point of a meta-benchmark like this is to force ourselves to  
simplemindedly look at only one number.


We could agree on a way of benchmarking a representative sample of  
current sites to get an idea of how widespread certain operations  
currently are.  We could talk with the maintainers of jQuery, Dojo,  
etc. to see what sorts of operations they think would be helpful to  
future apps to make faster.  We could instrument browsers to have  
some sort of (opt-in) sampling of real-world workloads.  etc.   
Surely together we can come up with ways to make Sunspider even  
better, while keeping its current strengths in mind.


I think these are all good ideas. I think there's one way in which  
sampling the Web is not quite right. To some extent, what matters is  
not average density of an operation but peak density. An operation  
that's used a *lot* by a few sites and hardly used by most sites, may  
deserve a weighting above its average proportion of Web use. I would  
like to hear input on what is inadequately covered. I tend to think  
there should be more coverage of the following:


- property access, involving at least some polymorphic access patterns
- method calls
- object-oriented programming patterns
- GC load
- programming in a style that makes significant use of closures

I think the V8 benchmark does a much better job of covering the first  
four of these things. I also think it overweights them, to the  
exclusion of most other considerations(*). As I mentioned before, I'd  
like to include some of V8's tests in a future SunSpider 2.0 content  
set.


It would be good to know what other things should be tested that are  
not sufficiently covered.


Regards,
Maciej

* - For example, Mozilla's TraceMonkey effort showed relatively little  
improvement on the V8 benchmark, even though it showed significant  
improvement on SunSpider and other benchmarks. I think TraceMonkey  
speedups are real and significant, so this would tend to undermine my  
confidence in the V8 benchmark's coverage. Note: I don't mean to start  
a side thread about whether the V8 benchmark is good or not, I just  
wanted to justify my remarks above. 
___

webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] Iterating SunSpider

2009-07-07 Thread Peter Kasting
On Tue, Jul 7, 2009 at 5:08 PM, Maciej Stachowiak m...@apple.com wrote:

 On Jul 7, 2009, at 4:19 PM, Peter Kasting wrote:

 For example, the framework could compute both sums _and_ geomeans, if
 people thought both were valuable.


 That's a plausible thing to do, but I think there's a downside: if you make
 a change that moves the two scores in opposite directions, the benchmark
 doesn't help you decide if it's good or not. Avoiding paralysis in the face
 of tradeoffs is part of the reason we look primarily at the total score, not
 the individual subtest scores. The whole point of a meta-benchmark like this
 is to force ourselves to simplemindedly look at only one number.


Yes, I originally had more text like deciding how to use these scores would
be the hard part, and this is precisely why.

I suppose that if different vendors wanted to use different criteria to
determine what to do in the face of a tradeoff, the benchmark could simply
be a data source, rather than a strong guide.  But this would make it
difficult to use the benchmark to compare engines, which is currently a key
use of SunSpider (and is a key failing, IMO, of frameworks like Dromaeo that
don't run identical code on every engine [IIRC]).

I think there's one way in which sampling the Web is not quite right. To
 some extent, what matters is not average density of an operation but peak
 density. An operation that's used a *lot* by a few sites and hardly used by
 most sites, may deserve a weighting above its average proportion of Web use.


If I understand you right, the effect you're noting is that speeding up
every web page by 1 ms might be a larger net win but a smaller perceived win
than speeding up, say, Gmail alone by 100 ms.

I think this is true.  One way to capture this would be to say that at least
part of the benchmark should concentrate on operations that are used in the
inner loops of any of n popular websites, without regard to their overall
frequency on the web.  (Although perhaps the two correlate well and there
aren't a lot of rare but peaky operations?  I don't know.)


 - GC load


I second this.  As people use more tabs and larger, more complex apps, the
performance of an engine under heavier GC load becomes more relevant.

It would be good to know what other things should be tested that are not
 sufficiently covered.


I think DOM bindings are hard to test and would benefit from benchmarking.
 No public benchmarks seem to test these well today.

* - For example, Mozilla's TraceMonkey effort showed relatively little
 improvement on the V8 benchmark, even though it showed significant
 improvement on SunSpider and other benchmarks. I think TraceMonkey speedups
 are real and significant, so this would tend to undermine my confidence in
 the V8 benchmark's coverage.


I agree that the V8 benchmark's coverage is inadequate and that the example
you mention illuminates that, because TraceMonkey definitely performs better
than SpiderMonkey in my own usage.  I wonder if there may have been an
opposite effect in a few cases where benchmarks with very simple tight loops
improved _more_ under TM than real-world code did, but I think the answer
to that is simply that benchmarks should be testing both kinds of code.

PK
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] Iterating SunSpider

2009-07-07 Thread Mike Belshe
On Tue, Jul 7, 2009 at 7:01 PM, Maciej Stachowiak m...@apple.com wrote:


 On Jul 7, 2009, at 6:43 PM, Mike Belshe wrote:


 (There are other benchmarks that use summation, for example iBench, though
 I am not sure these are examples of excellent benchmarks. Any benchmark that
 consists of a single test also implicitly uses summation. I'm not sure what
 other benchmarks do is as relevant of the technical merits.)

 Hehe - I don't think anyone has iBench except apple :-)


 This is now extremely tangential to the original point, but iBench is
 available to the general public here: 
 http://www.lionbridge.com/lionbridge/en-US/services/software-product-engineering/testing-veritest/benchmark-software.htm
 


Thanks!




  A lot of research has been put into benchmarking over the years; there is
 good reason for these choices, and they aren't arbitrary.  I have not seen
 research indicating that summing of scores is statistically useful, but
 there are plenty that have chosen geometric means.



 I think we're starting to repeat our positions at this point, without
 adding new information or really persuading each other.

 If you have research that shows statistical benefits to geometric mean
 scoring, or other new information to add, I would welcome it.


Only what is already on this thread or google for geometric mean
benchmark.

Mike





 Regards,
 Maciej


___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] Iterating SunSpider

2009-07-06 Thread Zoltan Herczeg
Hi,

 Can future versions
 of the SunSpider driver be made so that they won't become irrelevant over
 time?

I feel the weighting is more of an issue here than the total runtime.
Eventually some tests become dominant, and the gain (or loss) on them
almost determine the final results.

Besides, there was a discussion about SunSpider enhancements a year ago.
We collected some new JS benchmarks and put it into an WindScorpion (it is
another name of SunSpider) extension package. However, the topic died away
after a short time.

Zoltan


___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] Iterating SunSpider

2009-07-06 Thread Maciej Stachowiak


On Jul 6, 2009, at 10:11 AM, Geoffrey Garen wrote:

So, what you end up with is after a couple of years, the slowest  
test in the suite is the most significant part of the score.   
Further, I'll predict that the slowest test will most likely be the  
least relevant test, because the truly important parts of JS  
engines were already optimized.  This has happened with Sunspider  
0.9 - the regex portions of the test became the dominant factor,  
even though they were not nearly as prominent in the real world as  
they were in the benchmark.  This leads to implementors optimizing  
for the benchmark - and that is not what we want to encourage.


How did you determine that regex performance is not nearly as  
prominent in the real world?



For reference: in current JavaScriptCore, the one regexp-centric test  
is about 4.6% of the score by time. 3 of the string tests also spend  
their time in regexps, however, I think those are among the tests that  
most closely resemble what Web sites do. I believe the situation is  
roughly similar in other competitive JavaScript engines. This is  
probably not exactly proportionate but it doesn't dominate the test. I  
don't think any of this is a problem, unless one thinks the regexp  
improvements in Nitro, V8 and TraceMonkey were a waste of resources.


What I have seen happen is that numeric processing and especially  
integer math became a smaller and smaller proportion of the test,  
looking at the best publicly available engines over time. I think that  
turned out to be the case because math had much more room for  
optimization in naive implementations than, say, string processing.


Regards,
Maciej

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] Iterating SunSpider

2009-07-05 Thread Mike Belshe
On Sat, Jul 4, 2009 at 3:27 PM, Maciej Stachowiak m...@apple.com wrote:


 On Jul 4, 2009, at 11:47 AM, Mike Belshe wrote:

 I'd like to understand what's going to happen with SunSpider in the future.
  Here is a set of questions and criticisms.  I'm interested in how these can
 be addressed.

 There are 3 areas I'd like to see improved in
 SunSpider, some of which we've discussed before:


 #1: SunSpider is currently version 0.9.  Will SunSpider ever change?  Or is 
 it static?
 I believe that benchmarks need to be able to
 move with the times.  As JS Engines change and improve, and as new areas are 
 needed
 to be benchmarked, we need to be able to roll the version, fix bugs, and
 benchmark new features.  The SunSpider version has not changed for ~2yrs.
  How can we change this situation?  Are there plans for a new version
 already underway?


 I've been thinking about updating SunSpider for some time. There are two
 categories of changes I've thought about:

 1) Quality-of-implementation changes to the harness. Among these might be
 ability to use the harness with multiple test sets. That would be 1.0.


Cool



 2) An updated set of tests - the current tests are too short, and don't
 adequately cover some areas of the language. I'd like to make the tests take
 at least 100ms each on modern browsers on recent hardware. I'd also be
 interested in incorporating some of the tests from the v8 benchmark suite,
 if the v8 developers were ok with this. That would be SunSpider 2.0.


Cool.  Use of v8 tests is just fine; they're all open source.



 The reason I've been hesitant to make any changes is that the press and
 independent analysts latched on to SunSpider as a way of comparing
 JavaScript implementations. Originally, it was primarily intended to be a
 tool for the WebKit team to help us make our JavaScript faster. However, now
 that third parties are relying it, there are two things I want to be really
 careful about:

 a) I don't want to invalidate people's published data, so significant
 changes to the test content would need to be published as a clearly separate
 version.


Of course.  Small UI nit - the current SunSpider benchmark doesn't make the
version very prominent at all.  It would be nice to make it more salient.



 b) I want to avoid accidentally or intentionally making changes that are
 biased in favor of Safari or WebKit-based browsers in general, or that even
 give that impression. That would hurt the test's credibility. When we first
 made SunSpider, Safari actually didn't do that great on it, which I think
 helped people believe that the test wasn't designed to make us look good, it
 was designed to be a relatively unbiased comparison.


Of course.



 Thus, any change to the content would need to be scrutinized in some way.
 I'm not sure what it would take to get widespread agreement that a 2.0
 content set is fair, but I agree it's time to make one soonish (before the
 end of the year probably). Thoughts on this are welcome.


 #2: Use of summing as a scoring mechanism is problematic
 Unfortunately, the sum-based scoring techniques do not withstand the test
 of time as browsers improve.  When the benchmark was first introduced, each
 test was equally weighted and reasonably large.  Over time, however, the
 test becomes dominated by the slowest tests - basically the weighting of the
 individual tests is variable based on the performance of the JS engine under
 test.  Today's engines spend ~50% of their time on just string and date
 tests.  The other tests are largely irrelevant at this point, and becoming
 less relevant every day.  Eventually many of the tests will take near-zero
 time, and the benchmark will have to be scrapped unless we figure out a
 better way to score it.  Benchmarking research which long pre-dates
 SunSpider confirms that geometric means provide a better basis for
 comparison:  http://portal.acm.org/citation.cfm?id=5673 Can future
 versions of the SunSpider driver be made so that they won't become
 irrelevant over time?


 Use of summation instead of geometric mean was a considered choice. The
 intent is that engines should focus on whatever is slowest. A simplified
 example: let's say it's estimated that likely workload in the field will
 consist of 50% Operation A, and 50% of Operation B, and I can benchmark them
 in isolation. Now let's say implementation in Foo these operations are
 equally fast, while in implementation Bar, Operation A is 4x as fast as in
 Foo, while Operation B is 4x as slow as in Foo. A comparison by geometric
 means would imply that Foo and Bar are equally good, but Bar would actually
 be twice as slow on the intended workload.


I could almost buy this if:
   a)  we had a really really representative workload of what web pages do,
broken down into the exactly correct proportions.
   b)  the representative workload remains representative over time.

I'll argue that we'll never be very good at (a), and that (b) is impossible.

So, what 

Re: [webkit-dev] Iterating SunSpider

2009-07-05 Thread Joe Mason

Maciej Stachowiak wrote:
I think the pauses were large in an attempt to get stable, repeatable 
results, but are probably longer than necessary to achieve this. I agree 
with you that the artifacts in balanced power mode are a problem. Do 
you know what timer thresholds avoid the effect? I think this would be a 
reasonable 1.0 kind of change.


Just a gut feeling, but I suspect the exact throttling algorithm would 
vary too much from machine to machine and OS version to OS version to 
ever find a good threshold to avoid it.  The best thing to do would be 
to have the harness turn off CPU throttling when it starts.  (This is 
possible from the commandline under Linux, and I assume in Mac, but 
Windows might be a problem.)


Joe
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] Iterating SunSpider

2009-07-05 Thread George Staikos


On 4-Jul-09, at 2:47 PM, Mike Belshe wrote:


#2: Use of summing as a scoring mechanism is problematic
Unfortunately, the sum-based scoring techniques do not withstand  
the test of time as browsers improve.  When the benchmark was first  
introduced, each test was equally weighted and reasonably large.   
Over time, however, the test becomes dominated by the slowest tests  
- basically the weighting of the individual tests is variable based  
on the performance of the JS engine under test.  Today's engines  
spend ~50% of their time on just string and date tests.  The other  
tests are largely irrelevant at this point, and becoming less  
relevant every day.  Eventually many of the tests will take near- 
zero time, and the benchmark will have to be scrapped unless we  
figure out a better way to score it.  Benchmarking research which  
long pre-dates SunSpider confirms that geometric means provide a  
better basis for comparison:  http://portal.acm.org/citation.cfm? 
id=5673 Can future versions of the SunSpider driver be made so that  
they won't become irrelevant over time?


   Actually this doesn't happen on all CPUs.  For example CPUs  
without FPU have very different results.  memory performance is also  
a big factor.


#3: The SunSpider harness has a variance problem due to CPU power  
savings modes.
Because the test runs a tiny amount of Javascript (often under  
10ms) followed by a 500ms sleep, CPUs will go into power savings  
modes between test runs.  This radically changes the performance  
measurements and makes it so that comparison between two runs is  
dependent on the user's power savings mode.  To demonstrate this,  
run SunSpider on two machines- one with the Windows  
balanced (default) setting for power, and then again with high  
performance.  It's easy to see skews of 30% between these two  
modes.  I think we should change the test harness to avoid such  
accidental effects.


   I've noticed this issue too.

--
George Staikos
Torch Mobile Inc.
http://www.torchmobile.com/

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] Iterating SunSpider

2009-07-04 Thread Peter Kasting
On Sat, Jul 4, 2009 at 11:47 AM, Mike Belshe m...@belshe.com wrote:

 #3: The SunSpider harness has a variance problem due to CPU power savings
 modes.


This one worries me because it decreases the consistency/reproducibility of
test scores and makes it harder to compare engines or to track one engine's
scores over time.  For example, doing a bunch of CPU work just before
running the benchmark can affect whether and when the CPU throttles down
during the benchmark run.

Possible solution:
 The dromaeo test suite already incorporates the SunSpider individual tests
 under a new benchmark harness which fixes all 3 of the above issues.   Thus,
 one approach would be to retire SunSpider 0.9 in favor of Dromaeo.
 http://dromaeo.com/?sunspider  Dromaeo has also done a lot of good work to
 ensure statistical significance of the results.  Once we have a better
 benchmarking framework, it would be great to build a new microbenchmark mix
 which more realistically exercises today's JavaScript.


One complaint I have heard about the Dromaeo tests (not the harness) is that
the actual JS that gets run differs from browser to browser (e.g. because it
is a direct copy of a source library that does UA sniffing).  If this is
true it means that this suite as-is isn't useful to compare engines to each
other.

However, the Dromaeo _harness_ is probably a win as-is.

Of course, changing anything about Sunspider raises the question of
tracking historical performance.  Perhaps the harness could support
versioning, or perhaps people are simply willing to say Sunspider
1.0 scores cannot be compared to Sunspider 0.9 scores.  I believe this is
the approach the V8 benchmark takes.

PK
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] Iterating SunSpider

2009-07-04 Thread Maciej Stachowiak


On Jul 4, 2009, at 11:47 AM, Mike Belshe wrote:

I'd like to understand what's going to happen with SunSpider in the  
future.  Here is a set of questions and criticisms.  I'm interested  
in how these can be addressed.


There are 3 areas I'd like to see improved in SunSpider, some of  
which we've discussed before:


#1: SunSpider is currently version 0.9.  Will SunSpider ever  
change?  Or is it static?
I believe that benchmarks need to be able to move with the times.   
As JS Engines change and improve, and as new areas are needed to be  
benchmarked, we need to be able to roll the version, fix bugs, and  
benchmark new features.  The SunSpider version has not changed for  
~2yrs.  How can we change this situation?  Are there plans for a new  
version already underway?


I've been thinking about updating SunSpider for some time. There are  
two categories of changes I've thought about:


1) Quality-of-implementation changes to the harness. Among these might  
be ability to use the harness with multiple test sets. That would be  
1.0.


2) An updated set of tests - the current tests are too short, and  
don't adequately cover some areas of the language. I'd like to make  
the tests take at least 100ms each on modern browsers on recent  
hardware. I'd also be interested in incorporating some of the tests  
from the v8 benchmark suite, if the v8 developers were ok with this.  
That would be SunSpider 2.0.


The reason I've been hesitant to make any changes is that the press  
and independent analysts latched on to SunSpider as a way of comparing  
JavaScript implementations. Originally, it was primarily intended to  
be a tool for the WebKit team to help us make our JavaScript faster.  
However, now that third parties are relying it, there are two things I  
want to be really careful about:


a) I don't want to invalidate people's published data, so significant  
changes to the test content would need to be published as a clearly  
separate version.


b) I want to avoid accidentally or intentionally making changes that  
are biased in favor of Safari or WebKit-based browsers in general, or  
that even give that impression. That would hurt the test's  
credibility. When we first made SunSpider, Safari actually didn't do  
that great on it, which I think helped people believe that the test  
wasn't designed to make us look good, it was designed to be a  
relatively unbiased comparison.


Thus, any change to the content would need to be scrutinized in some  
way. I'm not sure what it would take to get widespread agreement that  
a 2.0 content set is fair, but I agree it's time to make one soonish  
(before the end of the year probably). Thoughts on this are welcome.




#2: Use of summing as a scoring mechanism is problematic
Unfortunately, the sum-based scoring techniques do not withstand the  
test of time as browsers improve.  When the benchmark was first  
introduced, each test was equally weighted and reasonably large.   
Over time, however, the test becomes dominated by the slowest tests  
- basically the weighting of the individual tests is variable based  
on the performance of the JS engine under test.  Today's engines  
spend ~50% of their time on just string and date tests.  The other  
tests are largely irrelevant at this point, and becoming less  
relevant every day.  Eventually many of the tests will take near- 
zero time, and the benchmark will have to be scrapped unless we  
figure out a better way to score it.  Benchmarking research which  
long pre-dates SunSpider confirms that geometric means provide a  
better basis for comparison:  http://portal.acm.org/citation.cfm?id=5673 
 Can future versions of the SunSpider driver be made so that they  
won't become irrelevant over time?


Use of summation instead of geometric mean was a considered choice.  
The intent is that engines should focus on whatever is slowest. A  
simplified example: let's say it's estimated that likely workload in  
the field will consist of 50% Operation A, and 50% of Operation B, and  
I can benchmark them in isolation. Now let's say implementation in Foo  
these operations are equally fast, while in implementation Bar,  
Operation A is 4x as fast as in Foo, while Operation B is 4x as slow  
as in Foo. A comparison by geometric means would imply that Foo and  
Bar are equally good, but Bar would actually be twice as slow on the  
intended workload.


Of course, doing this requires a judgment call on reasonable balance  
of different kinds of code, and that balance needs to be re-evaluated  
periodically. But tests based on geometric means also make an implied  
judgment call. The operations comprising each individual test are  
added linearly. The test then judges that these particular  
combinations are each equally important.





#3: The SunSpider harness has a variance problem due to CPU power  
savings modes.
Because the test runs a tiny amount of Javascript (often under 10ms)  
followed by a 500ms sleep, CPUs 

Re: [webkit-dev] Iterating SunSpider

2009-07-04 Thread Maciej Stachowiak


On Jul 4, 2009, at 1:06 PM, Peter Kasting wrote:


On Sat, Jul 4, 2009 at 11:47 AM, Mike Belshe m...@belshe.com wrote:
#3: The SunSpider harness has a variance problem due to CPU power  
savings modes.


This one worries me because it decreases the consistency/ 
reproducibility of test scores and makes it harder to compare  
engines or to track one engine's scores over time.  For example,  
doing a bunch of CPU work just before running the benchmark can  
affect whether and when the CPU throttles down during the benchmark  
run.


Possible solution:
The dromaeo test suite already incorporates the SunSpider individual  
tests under a new benchmark harness which fixes all 3 of the above  
issues.   Thus, one approach would be to retire SunSpider 0.9 in  
favor of Dromaeo.   http://dromaeo.com/?sunspider  Dromaeo has also  
done a lot of good work to ensure statistical significance of the  
results.  Once we have a better benchmarking framework, it would be  
great to build a new microbenchmark mix which more realistically  
exercises today's JavaScript.


One complaint I have heard about the Dromaeo tests (not the harness)  
is that the actual JS that gets run differs from browser to browser  
(e.g. because it is a direct copy of a source library that does UA  
sniffing).  If this is true it means that this suite as-is isn't  
useful to compare engines to each other.


However, the Dromaeo _harness_ is probably a win as-is.

Of course, changing anything about Sunspider raises the question of  
tracking historical performance.  Perhaps the harness could support  
versioning, or perhaps people are simply willing to say Sunspider  
1.0 scores cannot be compared to Sunspider 0.9 scores.  I believe  
this is the approach the V8 benchmark takes.


I think versioning the test content is right, and I think we should do  
that over time. I think a harness change to avoid triggering  
powersaving mode on Windows would be a reasonable thing to do to the  
harness without a version change. I don't think Dromaeo is a good  
choice of harness - I don't think their results are stable enough and  
I am not confident in the statistical soundness of their methodology.


Regards,
Maciej

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev