> Actually, no; the UDF Is a replica of the Hive implementation of your > definition, which Christian wrote.
Interesting. BTW I never made any page view definition, just spent time over the years to understand the real legacy definition and its deficiencies. -----Original Message----- From: analytics-boun...@lists.wikimedia.org [mailto:analytics-boun...@lists.wikimedia.org] On Behalf Of Oliver Keyes Sent: Friday, March 13, 2015 0:44 To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] [Technical] final pageviews QA On 12 March 2015 at 19:41, Erik Zachte <ezac...@wikimedia.org> wrote: >>>> Well, again; the wikistats data that Erik refers to doesn't have >>>> any granularity within the period this dataset covers. > > So I just uploaded > https://commons.wikimedia.org/wiki/File:PageViewsWikipedia2015.png > which shows daily page views as collected by webstatscollector since > 2008 and published in hourly projectcounts files in > https://dumps.wikimedia.org/other/pagecounts-raw/ > and aggregated by Wikistats per project (by week, month, day of week) and > published in e.g. > http://stats.wikimedia.org/EN/TablesPageViewsMonthlyOriginalCombined.h > tm (Wikipedia only, but webstatscollector doesn't report on any huge > PV increase for other projects) > > My initial comment in this thread (again) is that you defined a 'legacy' > definition yourself, and built a script to implement your legacy definition. Actually, no; the UDF Is a replica of the Hive implementation of your definition, which Christian wrote. > Which is fine with me, the more data points the better, but should not be > confused with vetting new vs old stats. > The old stats we published for many years, using which I will dub from now on > the 'real legacy definition'. > That real legacy definition, with all of its known deficiencies, is what will > matter for our veteran users and any discrepacy from there needs explaining. > > Since it's all in your head now, and you spent a long time to get it there, > I'd still recommend you finish this off and explain what has changed rather > than looking to a new person to do this. Unfortunately I've been moved from R&D, and don't have the time to answer endless "just one more thing..." questions. Again, if Toby wishes to ask Erik if he can borrow me, that's fine too. > > -----Original Message----- > From: analytics-boun...@lists.wikimedia.org > [mailto:analytics-boun...@lists.wikimedia.org] On Behalf Of Oliver > Keyes > Sent: Friday, March 13, 2015 0:00 > To: A mailing list for the Analytics Team at WMF and everybody who has an > interest in Wikipedia and analytics. > Subject: Re: [Analytics] [Technical] final pageviews QA > > Hmn. And now the UDF, the hive query, and the monthly aggregate of the hive > query, all disagree with > http://stats.wikimedia.org/EN/TablesPageViewsMonthlyAllProjects.htm . > All of the aforementioned sources come up with 24bn, not 20.38. Erik, how is > your data constructed from the pagecounts files, exactly? It's not made clear. > > I'd find it easier to believe it was an implementation problem if the UDF and > hive query didn't agree. Could it be some distinction in how the subsidiary > hive table is turned into stats.wikimedia.org numbers, from the "raw" count > of pageviews? > > In any case, this is now going somewhat beyond "Oliver, please run a quick > final check on the final definition"; that check has been run and shows a > pretty stable definition, without any odd day-to-day yo-yoing and a clear > week/weekend pattern, which is what we expect. > For additional analysis, I'd suggest either assigning someone to this task > (presumably whoever is maintaining the definition now) or, of course, asking > Erik if you could borrow me. I'm always happy to help out when I have the > time :). > > On 12 March 2015 at 18:43, Oliver Keyes <oke...@wikimedia.org> wrote: >> Certainly; running now. >> >> On 12 March 2015 at 18:33, Toby Negrin <tneg...@wikimedia.org> wrote: >>> Can we compare the monthly totals? >>> >>> On Thu, Mar 12, 2015 at 3:29 PM, Oliver Keyes <oke...@wikimedia.org> wrote: >>>> >>>> Well, again; the wikistats data that Erik refers to doesn't have >>>> any granularity within the period this dataset covers. Monthly data >>>> misses sub-monthly noise - like a massive transition that only >>>> kicks in on the day-by-day. >>>> >>>> On 12 March 2015 at 18:21, Toby Negrin <tneg...@wikimedia.org> wrote: >>>> > I'm also confused. As I understand it, stats.wikimedia.org is >>>> > consuming the data that is represented by the green line in your >>>> > graph. Therefore we would see this drop in the wikistats data >>>> > that Erik referred to, but we don't. >>>> > I >>>> > think we need to understand why this is so. >>>> > >>>> > -Toby >>>> > >>>> > On Thu, Mar 12, 2015 at 3:10 PM, Oliver Keyes >>>> > <oke...@wikimedia.org> >>>> > wrote: >>>> >> >>>> >> Well, I'm no longer our resident anything expert, merely /a/ >>>> >> anything expert :). >>>> >> >>>> >> The "concoction", as you put it, comes from the >>>> >> webrequest_all_sites data that is consumed by >>>> >> stats.wikimedia.org's primary report - I can't speak for how the >>>> >> dashboard you're linking to is constructed. >>>> >> Perhaps you could? I doubt this is a "concoction" problem given >>>> >> that, as you will note if you've studied the visualisations, >>>> >> both the UDF and the hive query implementation (which were >>>> >> written by two different people, and code reviewed by two /more/ >>>> >> people) agree that this dramatic, unexplained and untracked drop >>>> >> happened. And, since we've been using the hive query >>>> >> implementation for all our high-level numbers for about six >>>> >> months, a bug of this magnitude in the /implementation/ of the >>>> >> definition would be....worrying. >>>> >> >>>> >> Indeed, your report says 20B per month (again, is it drawing >>>> >> from the same data source as the aggregate, high-level number?) >>>> >> - I never claimed 1.1B a day, you did. Instead, it started off >>>> >> as approximately 1.1-1.2Bn, before dropping down to between 600m >>>> >> and 700m, where it has resided ever since. That sounds, >>>> >> averaged, like approximately 0.75B, no? The disadvantage of >>>> >> comparing a single monthly number against a more granular dataset. >>>> >> >>>> >> On 12 March 2015 at 17:55, Erik Zachte <ezac...@wikimedia.org> wrote: >>>> >> > I'd rather see you explain this, Oliver, as our incumbent page >>>> >> > views expert. >>>> >> > Your concoction of legacy PV seems to suggest 'Old definition, UDF' >>>> >> > was >>>> >> > about 1.1B per day. >>>> >> > >>>> >> > Yet >>>> >> > http://stats.wikimedia.org/EN/TablesPageViewsMonthlyAllProject >>>> >> > s .htm shows 20B per month, 0.75B per day >>>> >> > >>>> >> > Erik >>>> >> > >>>> >> > -----Original Message----- >>>> >> > From: analytics-boun...@lists.wikimedia.org >>>> >> > [mailto:analytics-boun...@lists.wikimedia.org] On Behalf Of >>>> >> > Oliver Keyes >>>> >> > Sent: Thursday, March 12, 2015 19:38 >>>> >> > To: A mailing list for the Analytics Team at WMF and everybody >>>> >> > who has an interest in Wikipedia and analytics. >>>> >> > Subject: [Analytics] [Technical] final pageviews QA >>>> >> > >>>> >> > Hey all, >>>> >> > >>>> >> > After the patches to the definition following the previous >>>> >> > hand-coding run (see older threads) I've run a second set of >>>> >> > tests. These can be seen at >>>> >> > https://commons.wikimedia.org/wiki/File:Pageviews_QA_2.png and >>>> >> > https://commons.wikimedia.org/wiki/File:Pageviews_QA_jittered_ >>>> >> > 2 >>>> >> > .png >>>> >> > >>>> >> > There's nothing particularly shocking in the new definition; >>>> >> > it follows the seasonal pattern that we're used to. I think we >>>> >> > can call the new definition done, with these tweaks! It's also >>>> >> > not as unstable as the legacy definition (good luck to whoever >>>> >> > now has the responsibility of explaining why pageviews >>>> >> > abruptly halved in the middle of February). >>>> >> > >>>> >> > >>>> >> > Have fun, >>>> >> > -- >>>> >> > Oliver Keyes >>>> >> > Research Analyst >>>> >> > Wikimedia Foundation >>>> >> > >>>> >> > _______________________________________________ >>>> >> > Analytics mailing list >>>> >> > Analytics@lists.wikimedia.org >>>> >> > https://lists.wikimedia.org/mailman/listinfo/analytics >>>> >> > >>>> >> > >>>> >> > _______________________________________________ >>>> >> > Analytics mailing list >>>> >> > Analytics@lists.wikimedia.org >>>> >> > https://lists.wikimedia.org/mailman/listinfo/analytics >>>> >> >>>> >> >>>> >> >>>> >> -- >>>> >> Oliver Keyes >>>> >> Research Analyst >>>> >> Wikimedia Foundation >>>> >> >>>> >> _______________________________________________ >>>> >> Analytics mailing list >>>> >> Analytics@lists.wikimedia.org >>>> >> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> > >>>> > >>>> > >>>> > _______________________________________________ >>>> > Analytics mailing list >>>> > Analytics@lists.wikimedia.org >>>> > https://lists.wikimedia.org/mailman/listinfo/analytics >>>> > >>>> >>>> >>>> >>>> -- >>>> Oliver Keyes >>>> Research Analyst >>>> Wikimedia Foundation >>>> >>>> _______________________________________________ >>>> Analytics mailing list >>>> Analytics@lists.wikimedia.org >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >>> >>> _______________________________________________ >>> Analytics mailing list >>> Analytics@lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >> >> >> >> -- >> Oliver Keyes >> Research Analyst >> Wikimedia Foundation > > > > -- > Oliver Keyes > Research Analyst > Wikimedia Foundation > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics > > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics -- Oliver Keyes Research Analyst Wikimedia Foundation _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics