BTW, Christian foresaw this issue and wrote this:
https://github.com/wikimedia/analytics-refinery-source/tree/master/guard

It should be useable for pageviews too, I think.  For this issue, a guard that 
made sure that outreach.wikimedia.org never appeared would have been an error.





> On Aug 17, 2015, at 14:45, Oliver Keyes <oke...@wikimedia.org> wrote:
> 
> On 17 August 2015 at 13:48, Joseph Allemandou <jalleman...@wikimedia.org> 
> wrote:
>> Hey Oliver,
>> 
>> The analytics team is responsible for the pageview definition.
>> When finding issues, sending an email to the analytics mailing list is the
>> right thing to do :)
>> 
> 
> Indeed; my point is not about issues reported upstream. My point is
> that there appears to currently be absolutely no work done to take
> this (org-level, highest possible priority) KPI and evaluate it every
> month or ever N days to make sure that, even with the gradual
> accretion of changes to the input data, it is still extracting what we
> want. It is down to user-reported issues. The problem with this
> approach is that after 90 days it is impossible to rerun the data; if
> there is a bug breaking the logs, and it takes more than 90 days to
> discover it, those logs are simply broken.
> 
> In addition, discovering these issues requires a very granular
> understanding of what the pageviews logs are meant to be capturing
> that most customers simply will not have. It worked in this case
> primarily because the customer actually /wrote/ the definition ;p.
> 
> For public transparency: Joseph and I talked on IRC and will be
> working on ways to validate data and detect these kinds of regressions
> in advance.
> 
>> On our end, we could surely do a better job to communicate changes in the
>> pageview definition code for anybody interested to review/comment/ask for
>> documentation.
>> Emails have been sent regularly about updates on the analytics list, except
>> in the past few month.
>> We shall get back to that good habit and send notifications with
>> explanations of the changes.
>> 
>> Joseph
>> 
>> 
>> 
>> 
>> On Mon, Aug 17, 2015 at 5:15 PM, Oliver Keyes <oke...@wikimedia.org> wrote:
>>> 
>>> You should also note that donate-wiki pageviews are making it into the
>>> counts (again, the definition was designed to exclude these).
>>> 
>>> Whose job is it to review pageviews and update the definition when
>>> issues are found?
>>> 
>>> On 17 August 2015 at 10:32, Oliver Keyes <oke...@wikimedia.org> wrote:
>>>> Just to clarify; there is no need to ask me before making changes
>>>> (obviously I find my approval for pageviews changes being sought
>>>> incredibly flattering, but I am not the only person involved in this
>>>> project ;p). What I'm more driving towards is directly informing
>>>> customers when the definition is adapted.
>>>> 
>>>> On 17 August 2015 at 10:31, Oliver Keyes <oke...@wikimedia.org> wrote:
>>>>> Excellent; thank you.
>>>>> 
>>>>> On 17 August 2015 at 04:42, Joseph Allemandou
>>>>> <jalleman...@wikimedia.org> wrote:
>>>>>> Oliver,
>>>>>> 
>>>>>> It was a mistake from me to add the 'outreach' subdomain without
>>>>>> asking you.
>>>>>> 
>>>>>> From a documentation perspective, the analytics team uses that place
>>>>>> to
>>>>>> document changes:
>>>>>> https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest and I
>>>>>> didn't
>>>>>> know about up-to-date documentation you sent.
>>>>>> 
>>>>>> Tickets have been created to both correct the bug and update the
>>>>>> documentation pages.
>>>>>> 
>>>>>> Joseph
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Sun, Aug 16, 2015 at 8:47 PM, Oliver Keyes <oke...@wikimedia.org>
>>>>>> wrote:
>>>>>>> 
>>>>>>> Ah, I see the problem; someone patched it and never documented it.
>>>>>>> 
>>>>>>> We have documentation at
>>>>>>> 
>>>>>>> https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters
>>>>>>> of the generalised filters. There is also a log, on
>>>>>>> https://meta.wikimedia.org/wiki/Research:Page_view, of changes to the
>>>>>>> pageview definition.
>>>>>>> 
>>>>>>> The intent behind both the transparent definition and the log is to
>>>>>>> ensure that we know what is going /in/ the definition.
>>>>>>> 
>>>>>>> In this case, somebody has patched the definition
>>>>>>> 
>>>>>>> 
>>>>>>> (https://github.com/wikimedia/analytics-refinery-source/commit/cc0b6ed7e4f403eaa82235ec6a0f27152b0c2710)
>>>>>>> to include traffic from outreach.wikimedia.org - a site that was very
>>>>>>> deliberately and very explicitly excluded from the definition as it
>>>>>>> was written.
>>>>>>> 
>>>>>>> There is no explanation of why this change was made, there is no
>>>>>>> documentation of this change even existing outside the actual
>>>>>>> Java....
>>>>>>> can someone please explain what this is for, and update all the
>>>>>>> documentation to reflect that? And then could people be very, very
>>>>>>> clear in future that it is expected there be a log of alterations you
>>>>>>> make to high-level KPIs beyond the, you know, commit logs.
>>>>>>> 
>>>>>>> On 16 August 2015 at 14:32, Madhumitha Viswanathan
>>>>>>> <mviswanat...@wikimedia.org> wrote:
>>>>>>>> The new one.
>>>>>>>> 
>>>>>>>> The code that generates it -
>>>>>>>> 
>>>>>>>> -
>>>>>>>> 
>>>>>>>> 
>>>>>>>> https://github.com/wikimedia/analytics-refinery/blob/master/hive/pageview/hourly/create_pageview_hourly_table.hql
>>>>>>>> -
>>>>>>>> 
>>>>>>>> 
>>>>>>>> https://github.com/wikimedia/analytics-refinery/tree/master/oozie/pageview/hourly
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Sun, Aug 16, 2015 at 11:01 AM, Oliver Keyes
>>>>>>>> <oke...@wikimedia.org>
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> Is the pageviews_hourly table meant to contain pageviews according
>>>>>>>>> to
>>>>>>>>> the new or old definition? If old, where can I find aggregates for
>>>>>>>>> the
>>>>>>>>> new one?
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Oliver Keyes
>>>>>>>>> Count Logula
>>>>>>>>> Wikimedia Foundation
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> Analytics mailing list
>>>>>>>>> Analytics@lists.wikimedia.org
>>>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> --Madhu :)
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> Analytics mailing list
>>>>>>>> Analytics@lists.wikimedia.org
>>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Oliver Keyes
>>>>>>> Count Logula
>>>>>>> Wikimedia Foundation
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> Analytics mailing list
>>>>>>> Analytics@lists.wikimedia.org
>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Joseph Allemandou
>>>>>> Data Engineer @ Wikimedia Foundation
>>>>>> IRC: joal
>>>>>> 
>>>>>> _______________________________________________
>>>>>> Analytics mailing list
>>>>>> Analytics@lists.wikimedia.org
>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Oliver Keyes
>>>>> Count Logula
>>>>> Wikimedia Foundation
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Oliver Keyes
>>>> Count Logula
>>>> Wikimedia Foundation
>>> 
>>> 
>>> 
>>> --
>>> Oliver Keyes
>>> Count Logula
>>> Wikimedia Foundation
>>> 
>>> _______________________________________________
>>> Analytics mailing list
>>> Analytics@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>> 
>> 
>> 
>> 
>> --
>> Joseph Allemandou
>> Data Engineer @ Wikimedia Foundation
>> IRC: joal
>> 
>> _______________________________________________
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>> 
> 
> 
> 
> -- 
> Oliver Keyes
> Count Logula
> Wikimedia Foundation
> 
> _______________________________________________
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics


_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to