[Wiki-research-l] Re: [OW] fixing performance regressions before they happen

Isaac Johnson Tue, 01 Feb 2022 12:14:58 -0800

Thanks for sharing the article SJ and additional details Peter! Just wanted
to mention that, tangentially related, there is a place in Wikimedia where
anomaly detection is used for monitoring "performance" and that's around
detecting instances of Wikipedia outages (often censorship). More details
in this blogpost:
https://techblog.wikimedia.org/2021/01/15/censorship-outages-and-internet-shutdowns-monitoring-wikipedias-accessibility-around-the-world/


Best,
Isaac

On Mon, Jan 31, 2022 at 8:30 AM <pe...@wikimedia.org> wrote:

> Hi Samuel,
> my name is Peter and I work in the performance team. I also read the post
> and I also found it interesting. Our performance metrics are viewable in
> Grafana, a good start point is the performance summary dashboard:
> https://grafana.wikimedia.org/d/cZgMg49Wz/performance-summary. We have
> many dashboards but we lack some documentations, so please ask so I can
> guide you.
>
> We collect and keep track of performance metrics directly from our users,
> we run synthetic browser tests every X hour where we record a video of the
> browser screen, collect visual metrics and we also run some tests on
> commits.
>
> The largest research we've done in this is the study Gilles did about
> correlation between what the user perceive vs browser metrics
> https://techblog.wikimedia.org/2019/06/17/performance-perception-correlation-to-rum-metrics/
> and the paper https://nonsns.github.io/paper/rossi19www.pdf.
>
> For regressions, I've gone through the same path as the people at Netflix
> by trying different amount of runs, taking median/fastest/slowest runs etc
> to find more "stable" metrics. We don't proxy performance by memory usage,
> we focus more on visual metrics for the users and for us we need to do more
> than three runs. We do 5-11 runs depending on what we test. I haven't
> blogged about that work but it should be in some Phabricator tasks, I can
> look it up if you are interested. What is also interesting is what kind of
> practical regression you could find. In our most trimmed systems I think we
> can find performance regressions that are slighlty over 2%. But there's
> parts where the regression needs to be 10-20% for us to get alerts.
>
> I wrote a blog post a couple of years ago about one regression
> https://techblog.wikimedia.org/2018/10/03/best-friends-forever/
>
> I like the use of anomaly detection, we discussed that in the teams some
> time ago but we haven't tried it out. Today we mostly use static thresholds
> in some way. I think a tool for anomaly detection would be something many
> teams could use.
>
> I really like that they have statistics about false alerts etc. We don't
> have that today but we should. I started to keep track of them manually,
> but hmm I failed :)
>
> Best
> Peter Hedenskog
> _______________________________________________
> Wiki-research-l mailing list -- wiki-research-l@lists.wikimedia.org
> To unsubscribe send an email to wiki-research-l-le...@lists.wikimedia.org
>


-- 
Isaac Johnson (he/him/his) -- Research Scientist -- Wikimedia Foundation
_______________________________________________
Wiki-research-l mailing list -- wiki-research-l@lists.wikimedia.org
To unsubscribe send an email to wiki-research-l-le...@lists.wikimedia.org

[Wiki-research-l] Re: [OW] fixing performance regressions before they happen

Reply via email to