> > 1:1000 sampling is fine for frwiki thumbnail clicks, but not for cawiki > fullscreen button presses >
Since the issue is the global load, I think it'd be resolved by changing the sampling rate for the large wikis only. The small ones going back to 1:1 would be fine, as they contribute little to the global load. Is there a way to set different PHP settings for small wikipedias than for large ones, though? - whenever we display total counts, we use sum(sampling_rate) instead of > count(*) > The query for actions is a bit more complex: https://git.wikimedia.org/blob/analytics%2Fmultimedia.git/1fa576fabbf6598f064e4d05a59171a92bdd2033/actions%2Ftemplate.sqlbut "THEN sampling_rate ELSE 0" should work, afaik. - whenever we display geometric means, we weight by sampling rate (exp(sum(sampling_rate > * ln(value)) / sum(sampling_rate)) instead of exp(avg(ln(value)))) > I don't follow the logic here. Like percentiles, averages should be unaffected by sampling, geometric or not. I'll go ahead and write changesets to add sampling_rate to the schemas and Media Viewer's code, we're going to need that anyway. On Sun, May 18, 2014 at 7:00 AM, Gergo Tisza <gti...@wikimedia.org> wrote: > On Fri, May 16, 2014 at 9:34 AM, Ori Livneh <o...@wikimedia.org> wrote: > >> On Fri, May 16, 2014 at 9:17 AM, Federico Leva (Nemo) <nemow...@gmail.com >> > wrote: >> >>> * From 40 to 260 events logged per second in a month: what's going on? >> >> >> Eep, thanks for raising the alarm. MediaViewer is 170 events / >> sec, MultimediaViewerDuration is 38 / sec. >> >> +CC Multimedia. >> > > After an IRC discussion we added 1:1000 sampling to both of those schemas. > I'll need a little help fixing things on the data processing side; I'll > give a short description of how we use the data first. > > A MediaViewer event represents a user action (e.g. clicking on a > thumbnail, or using the back button in the browser while the lightbox is > open). The most used actions are (were, before the sampling) logged a few > million times a day; the least used ones less than a thousand times. > We use the data to display graphs like this: > http://multimedia-metrics.wmflabs.org/dashboards/mmv#actions-graphs-tab > There are also per-wiki graphs; there is about three magnitudes of > difference between the largest and the smallest wikis (will be more once we > roll out on English). > > A MultimediaViewerDuration event contains data about how much the user had > to wait (such as milliseconds between clicking the thumbnail and displaying > the image). This is fairly new and we don't have graphs yet, but they will > look something like these (which show the latency of our network requests): > > http://multimedia-metrics.wmflabs.org/dashboards/mmv#overall_network_performance-graphs-tab > > http://multimedia-metrics.wmflabs.org/dashboards/mmv#geographical_network_performance-graphs-tab > that is, they are used to calculate a geometric mean and various > percentiles, with per-wiki and per-country breakdown. > > What I would like to understand is: 1) how we need to modify these charts > to account for the sampling, 2) how we can make sure the sampling does not > result in loss of low-volume data (e.g. from wikis which have less traffic). > > == How to take the sampling into account == > > For the activity charts which show total event counts, this is easy: we > just need to multiply the count by the sampling ratio. > > For percentile charts, my understanding is (thanks for the IRC advice, > Nuria and Leila!) that they remain accurate, as long as the amount sampled > is large enough; the best practice is to sample at least 1000 events per > bucket (so 10,000 altogether if we are looking for the 90th percentile, > 100,000 if we are looking for the 99th percentile etc). > > I'm still looking for an answer on what effect sampling has on geometric > means. > > == How to handle data sources with very different volumes == > > As I said above, there are about three magnitudes of difference between > data volume for frequent and rare user actions, and also between large and > small wikis (probably even more for countries - if you look at the map > linked above, you can see that some African countries are missing: we use > 1:1000 sampling and haven't collected a single data point there yet). > > So to get a proper amount of data, we would probably need to vary sampling > per wiki or country, and also per action: 1:1000 sampling is fine for > frwiki thumbnail clicks, but not for cawiki fullscreen button presses. The > question is, how to mix different data sources? For example, we might > decide to sample thumbnail clicks 1:1000 on enwiki but only 1:100 on > dewiki, and then we want to show a graph of global clicks which includes > both enwiki and dewiki counts. > > Here is what I came up with: > - we add a "sampling rate" field to all our schemas > - the rule to determine the sampling rate of a given event (i.e. the > reciprocal of the probability of the event getting logged) can be as > difficult as we like, as long as the logging code saves that number as well > - whenever we display total counts, we use sum(sampling_rate) instead of > count(*) > - whenever we display percentiles, we ignore sampling rates, they should > not influence the result even if we consider data from multiple sources > with mixed sampling rates (I'm not quite sure about this one) > - whenever we display geometric means, we weight by sampling rate > (exp(sum(sampling_rate > * ln(value)) / sum(sampling_rate)) instead of exp(avg(ln(value)))) > > Do you think that would yield correct results? > > _______________________________________________ > Analytics mailing list > analyt...@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics > >
_______________________________________________ Multimedia mailing list Multimedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/multimedia