> Actually, no; the UDF Is a replica of the Hive implementation of your 
> definition, which Christian wrote.

Interesting. BTW I never made any page view definition, just spent time over 
the years to understand the real legacy definition and its deficiencies.

-----Original Message-----
From: analytics-boun...@lists.wikimedia.org 
[mailto:analytics-boun...@lists.wikimedia.org] On Behalf Of Oliver Keyes
Sent: Friday, March 13, 2015 0:44
To: A mailing list for the Analytics Team at WMF and everybody who has an 
interest in Wikipedia and analytics.
Subject: Re: [Analytics] [Technical] final pageviews QA

On 12 March 2015 at 19:41, Erik Zachte <ezac...@wikimedia.org> wrote:
>>>> Well, again; the wikistats data that Erik refers to doesn't have 
>>>> any granularity within the period this dataset covers.
>
> So I just uploaded 
> https://commons.wikimedia.org/wiki/File:PageViewsWikipedia2015.png
> which shows daily page views as collected by webstatscollector since 
> 2008 and published in hourly projectcounts files in 
> https://dumps.wikimedia.org/other/pagecounts-raw/
> and aggregated by Wikistats per project (by week, month, day of week) and 
> published in e.g.
> http://stats.wikimedia.org/EN/TablesPageViewsMonthlyOriginalCombined.h
> tm (Wikipedia only, but webstatscollector doesn't report on any huge 
> PV increase for other projects)
>
> My initial comment in this thread (again) is that you defined a 'legacy' 
> definition yourself, and built a script to implement your legacy definition.

Actually, no; the UDF Is a replica of the Hive implementation of your 
definition, which Christian wrote.

> Which is fine with me, the more data points the better, but should not be 
> confused with vetting new vs old stats.
> The old stats we published for many years, using which I will dub from now on 
> the 'real legacy definition'.
> That real legacy definition, with all of its known deficiencies, is what will 
> matter for our veteran users and any discrepacy from there needs explaining.
>
> Since it's all in your head now, and you spent a long time to get it there, 
> I'd still recommend you finish this off and explain what has changed rather 
> than looking to a new person to do this.

Unfortunately I've been moved from R&D, and don't have the time to answer 
endless "just one more thing..." questions. Again, if Toby wishes to ask Erik 
if he can borrow me, that's fine too.

>
> -----Original Message-----
> From: analytics-boun...@lists.wikimedia.org 
> [mailto:analytics-boun...@lists.wikimedia.org] On Behalf Of Oliver 
> Keyes
> Sent: Friday, March 13, 2015 0:00
> To: A mailing list for the Analytics Team at WMF and everybody who has an 
> interest in Wikipedia and analytics.
> Subject: Re: [Analytics] [Technical] final pageviews QA
>
> Hmn. And now the UDF, the hive query, and the monthly aggregate of the hive 
> query, all disagree with 
> http://stats.wikimedia.org/EN/TablesPageViewsMonthlyAllProjects.htm .
> All of the aforementioned sources come up with 24bn, not 20.38. Erik, how is 
> your data constructed from the pagecounts files, exactly? It's not made clear.
>
> I'd find it easier to believe it was an implementation problem if the UDF and 
> hive query didn't agree. Could it be some distinction in how the subsidiary 
> hive table is turned into stats.wikimedia.org numbers, from the "raw" count 
> of pageviews?
>
> In any case, this is now going somewhat beyond "Oliver, please run a quick 
> final check on the final definition"; that check has been run and shows a 
> pretty stable definition, without any odd day-to-day yo-yoing and a clear 
> week/weekend pattern, which is what we expect.
> For additional analysis, I'd suggest either assigning someone to this task 
> (presumably whoever is maintaining the definition now) or, of course, asking 
> Erik if you could borrow me. I'm always happy to help out when I have the 
> time :).
>
> On 12 March 2015 at 18:43, Oliver Keyes <oke...@wikimedia.org> wrote:
>> Certainly; running now.
>>
>> On 12 March 2015 at 18:33, Toby Negrin <tneg...@wikimedia.org> wrote:
>>> Can we compare the monthly totals?
>>>
>>> On Thu, Mar 12, 2015 at 3:29 PM, Oliver Keyes <oke...@wikimedia.org> wrote:
>>>>
>>>> Well, again; the wikistats data that Erik refers to doesn't have 
>>>> any granularity within the period this dataset covers. Monthly data 
>>>> misses sub-monthly noise - like a massive transition that only 
>>>> kicks in on the day-by-day.
>>>>
>>>> On 12 March 2015 at 18:21, Toby Negrin <tneg...@wikimedia.org> wrote:
>>>> > I'm also confused. As I understand it, stats.wikimedia.org is 
>>>> > consuming the data that is represented by the green line in your 
>>>> > graph. Therefore we would see this drop in the wikistats data 
>>>> > that Erik referred to, but we don't.
>>>> > I
>>>> > think we need to understand why this is so.
>>>> >
>>>> > -Toby
>>>> >
>>>> > On Thu, Mar 12, 2015 at 3:10 PM, Oliver Keyes 
>>>> > <oke...@wikimedia.org>
>>>> > wrote:
>>>> >>
>>>> >> Well, I'm no longer our resident anything expert, merely /a/ 
>>>> >> anything expert :).
>>>> >>
>>>> >> The "concoction", as you put it, comes from the 
>>>> >> webrequest_all_sites data that is consumed by 
>>>> >> stats.wikimedia.org's primary report - I can't speak for how the 
>>>> >> dashboard you're linking to is constructed.
>>>> >> Perhaps you could? I doubt this is a "concoction" problem given 
>>>> >> that, as you will note if you've studied the visualisations, 
>>>> >> both the UDF and the hive query implementation (which were 
>>>> >> written by two different people, and code reviewed by two /more/ 
>>>> >> people) agree that this dramatic, unexplained and untracked drop 
>>>> >> happened. And, since we've been using the hive query 
>>>> >> implementation for all our high-level numbers for about six 
>>>> >> months, a bug of this magnitude in the /implementation/ of the 
>>>> >> definition would be....worrying.
>>>> >>
>>>> >> Indeed, your report says 20B per month (again, is it drawing 
>>>> >> from the same data source as the aggregate, high-level number?) 
>>>> >> - I never claimed 1.1B a day, you did. Instead, it started off 
>>>> >> as approximately 1.1-1.2Bn, before dropping down to between 600m 
>>>> >> and 700m, where it has resided ever since. That sounds, 
>>>> >> averaged, like approximately 0.75B, no? The disadvantage of 
>>>> >> comparing a single monthly number against a more granular dataset.
>>>> >>
>>>> >> On 12 March 2015 at 17:55, Erik Zachte <ezac...@wikimedia.org> wrote:
>>>> >> > I'd rather see you explain this, Oliver, as our incumbent page 
>>>> >> > views expert.
>>>> >> > Your concoction of legacy PV seems to suggest 'Old definition, UDF'
>>>> >> > was
>>>> >> > about 1.1B per day.
>>>> >> >
>>>> >> > Yet
>>>> >> > http://stats.wikimedia.org/EN/TablesPageViewsMonthlyAllProject
>>>> >> > s .htm shows 20B per month, 0.75B per day
>>>> >> >
>>>> >> > Erik
>>>> >> >
>>>> >> > -----Original Message-----
>>>> >> > From: analytics-boun...@lists.wikimedia.org
>>>> >> > [mailto:analytics-boun...@lists.wikimedia.org] On Behalf Of 
>>>> >> > Oliver Keyes
>>>> >> > Sent: Thursday, March 12, 2015 19:38
>>>> >> > To: A mailing list for the Analytics Team at WMF and everybody 
>>>> >> > who has an interest in Wikipedia and analytics.
>>>> >> > Subject: [Analytics] [Technical] final pageviews QA
>>>> >> >
>>>> >> > Hey all,
>>>> >> >
>>>> >> > After the patches to the definition following the previous 
>>>> >> > hand-coding run (see older threads) I've run a second set of 
>>>> >> > tests. These can be seen at 
>>>> >> > https://commons.wikimedia.org/wiki/File:Pageviews_QA_2.png and
>>>> >> > https://commons.wikimedia.org/wiki/File:Pageviews_QA_jittered_
>>>> >> > 2
>>>> >> > .png
>>>> >> >
>>>> >> > There's nothing particularly shocking in the new definition; 
>>>> >> > it follows the seasonal pattern that we're used to. I think we 
>>>> >> > can call the new definition done, with these tweaks! It's also 
>>>> >> > not as unstable as the legacy definition (good luck to whoever 
>>>> >> > now has the responsibility of explaining why pageviews 
>>>> >> > abruptly halved in the middle of February).
>>>> >> >
>>>> >> >
>>>> >> > Have fun,
>>>> >> > --
>>>> >> > Oliver Keyes
>>>> >> > Research Analyst
>>>> >> > Wikimedia Foundation
>>>> >> >
>>>> >> > _______________________________________________
>>>> >> > Analytics mailing list
>>>> >> > Analytics@lists.wikimedia.org
>>>> >> > https://lists.wikimedia.org/mailman/listinfo/analytics
>>>> >> >
>>>> >> >
>>>> >> > _______________________________________________
>>>> >> > Analytics mailing list
>>>> >> > Analytics@lists.wikimedia.org
>>>> >> > https://lists.wikimedia.org/mailman/listinfo/analytics
>>>> >>
>>>> >>
>>>> >>
>>>> >> --
>>>> >> Oliver Keyes
>>>> >> Research Analyst
>>>> >> Wikimedia Foundation
>>>> >>
>>>> >> _______________________________________________
>>>> >> Analytics mailing list
>>>> >> Analytics@lists.wikimedia.org
>>>> >> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>> >
>>>> >
>>>> >
>>>> > _______________________________________________
>>>> > Analytics mailing list
>>>> > Analytics@lists.wikimedia.org
>>>> > https://lists.wikimedia.org/mailman/listinfo/analytics
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> Oliver Keyes
>>>> Research Analyst
>>>> Wikimedia Foundation
>>>>
>>>> _______________________________________________
>>>> Analytics mailing list
>>>> Analytics@lists.wikimedia.org
>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>>
>>> _______________________________________________
>>> Analytics mailing list
>>> Analytics@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>
>>
>>
>> --
>> Oliver Keyes
>> Research Analyst
>> Wikimedia Foundation
>
>
>
> --
> Oliver Keyes
> Research Analyst
> Wikimedia Foundation
>
> _______________________________________________
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
> _______________________________________________
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics



--
Oliver Keyes
Research Analyst
Wikimedia Foundation

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to