Hi all!

I've been seeing a lot of people starting using raptor for testing performance 
of their patches/code, especially in the context of 2.2 -> 2.5 regressions.

That's awesome!

Now, on top of that, :stas has developed a neat app that helps you get *more* 
out of those tests. In particular, it helps you learn if the difference you see 
is statistically significant[0].

That's important. Not perfect yet, but super important. What it means is that 
it answers a question of wherever the change you see can be explained by 
fluctuations in results within your test.

So instead of trying to guess, if the 100ms visuallyLoaded you see between two 
test results is real, install raptor-compare and follow the steps below:

1) Remove "metrics.ldjson" from the directory you are in
2) Run your raptor test with as many runs as you can
3) Apply your change
4) Run your raptor test with the same amount of runes
5) raptor-compare ./metrics.ldjson

zbraniecki@rivia:~$ raptor-compare ./metrics.ldjson
fm.gaiamobile.org      base: mean  1: mean  1: delta  1: p-value
---------------------  ----------  -------  --------  ----------
navigationLoaded              528      524        -4        0.72
navigationInteractive         738      721       -17        0.77
visuallyLoaded                738      721       -17        0.77
contentInteractive            738      722       -17        0.76
fullyLoaded                   923      903       -19        0.59
rss                        29.595   29.412    -0.183      * 0.02
uss                        11.098   11.001    -0.098      * 0.04
pss                        15.050   14.970    -0.080      * 0.03

Reading the results - the most important thing is the little asterisk next to 
p-value[1]. If p-value is below 5% it suggests that the data observed is not 
consistent with the assumption that there are no difference between those two 
groups.

In this example, it states, that there's less than 4% chance, the USS 
difference of almost 100kb is random.
At the same time the 20ms difference in fullyLoaded can be totally random.

If you are getting p-value above 5%, you should reduce your trust in your 
results and consider rerunning your tests with more runs.

Hope that helps!
zb.



[0] https://en.wikipedia.org/wiki/Statistical_significance
[1] https://en.wikipedia.org/wiki/P-value
_______________________________________________
dev-fxos mailing list
[email protected]
https://lists.mozilla.org/listinfo/dev-fxos

Reply via email to