Re: [Dev] On Tinderbox perf number presentation

Alec Flett Mon, 03 Oct 2005 10:10:24 -0700

A few comments:
1) it seems like stddev is really a function of the testing methodology not the actual measurements, and totally inappropriate for this chart. How about we just assume that tests over a certain stddev is simply bad data? We shouldn't all have to learn what stddev is good and what is bad to interpret the chart. Just have a threshold and if we're over the threshold, blank out that test result because the actual measurements are totally bogus.
2) instead of "0.6" vs. "time" how about "target" vs. "actual" or "target" vs. "current"? Jeffrey just said "wait, is 0.6 the target? or is it a typo and it should be 0.5?
3) if we're putting "s" in the seconds column, why can't we put "%" in the percent column? Again, I look and see "13" and don't know what that means.
4) Why can't we put "+" in front of positive delta values? delta by definition is +/-, so I find leaving out the "+" to be very confusing.

Alec

Heikki Toivonen wrote:

Philippe Bossut wrote:

I think the point made my Bryan and Katie (and I have to agree with
them) is that there's no way to see at a glance if we reached acceptable
performance or not. The color code currently used is misleading (e.g.
why is the Linux perf of importing a 3000 events calendar green? it's
far from acceptable, if it's just to see that it is better than 0.5,
it's easy to see at a glance...)


Ok, here's my latest proposal:


    |0.6   |Windows (r 7503)       |OSX (r 7500)          | Linux col
Test|Target|                 |std  |                |std  | here
    |      |time  |d %|d t   |dev  |time |d %|d t   |dev  |
===========================================================
#1  | 10 s |9.94s |-2%|-0.02s|0.01s|18.2s|-1%|-0.02s|0.04s|
#2  | 1 s  |1.14s |0% |0s    |0.00s|2.24s|0% |0s    |0.01s|
...
   [Previous results][Help]

* The first column is the test short test description. Where should the
link lead?

* The second column is the 0.6 target time.

* The 3-6 columns are for Windows results, next 4 for Mac, and last 4
for Linux (omitted here)

** Top row says the platform, and from which revision these numbers are from

** 2-3 rows are the actual column headers: time, delta %, delta time,
standard deviation. The deltas are compared to the last measurement on
that platform. The std deviation is there to let you know how likely the
change was just random noise - if less than std dev it almost certainly
was noise (but if it starts staying consistently at the new value then
it was a real change).

I think it is crucial to report the difference to the previous
measurements, because whenever you check in, you should check if your
checkin made a difference compared to the previous results.

Likewise, since in my opinion the most critical piece of information
here is the trend we are making, the most noticeable coloring should
happen based on the deltas to the previous run. If you made it slower
(something above std dev limit), it should show up as either orange or
red cell background depending on how bad it was. If it got noticeably
better, it should show green background. If the change was within std
dev, don't color because we don't know if it is real change or noise.

Now I could see maybe drawing colored text that would indicate how the
actual measured time compares to the target. If the measured time is
below target, green. Slightly above, orange, and if way above, red.

The color thresholds would be up for debate. std dev plays a role, but
currently std dev is very small so it shouldn't matter much. I think it
should be orange as soon as it is noticeably (more than std dev) on the
worse side. Red... hmm, I'd like to put that pretty low for deltas at
least, like 10% change for the worse and you put the tree on fire. Btw,
we would need to decide also what to do if that happens/in what
situations it is acceptable to make perf worse.

The previous results link would open a new page which would have the
latest + n number of previous tables stacked as a history on the page.
At a later date it could contain the perf trend graph as well. Help link
would open some docs on how to read the results, how tests are run etc.

See attachment for html mockup.

tbox sample

Test	0.6 Target	Windows (r 7503)				OSX (r 7502)				Linux (r 7503)
Test	0.6 Target	time	Δ %	Δ time	std dev	time	Δ %	Δ time	std dev	time	Δ %	Δ time	std dev
#1 startup	10 s	9.94s	-2%	-0.02	0.01s	18.2s	-1%	-0.02s	0.06s	7.86s	0%	0s	0.00s
#2 new event (menu)	1 s	1.14s	0%	0s	0.00s	2.22s	+50%	+1.24s	0.02s	0.986s	+5%	+0.06s	0.01s


_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Open Source Applications Foundation "Dev" mailing list
http://lists.osafoundation.org/mailman/listinfo/dev

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Open Source Applications Foundation "Dev" mailing list
http://lists.osafoundation.org/mailman/listinfo/dev

Re: [Dev] On Tinderbox perf number presentation

Reply via email to