BTW, Christian foresaw this issue and wrote this: https://github.com/wikimedia/analytics-refinery-source/tree/master/guard
It should be useable for pageviews too, I think. For this issue, a guard that made sure that outreach.wikimedia.org never appeared would have been an error. > On Aug 17, 2015, at 14:45, Oliver Keyes <oke...@wikimedia.org> wrote: > > On 17 August 2015 at 13:48, Joseph Allemandou <jalleman...@wikimedia.org> > wrote: >> Hey Oliver, >> >> The analytics team is responsible for the pageview definition. >> When finding issues, sending an email to the analytics mailing list is the >> right thing to do :) >> > > Indeed; my point is not about issues reported upstream. My point is > that there appears to currently be absolutely no work done to take > this (org-level, highest possible priority) KPI and evaluate it every > month or ever N days to make sure that, even with the gradual > accretion of changes to the input data, it is still extracting what we > want. It is down to user-reported issues. The problem with this > approach is that after 90 days it is impossible to rerun the data; if > there is a bug breaking the logs, and it takes more than 90 days to > discover it, those logs are simply broken. > > In addition, discovering these issues requires a very granular > understanding of what the pageviews logs are meant to be capturing > that most customers simply will not have. It worked in this case > primarily because the customer actually /wrote/ the definition ;p. > > For public transparency: Joseph and I talked on IRC and will be > working on ways to validate data and detect these kinds of regressions > in advance. > >> On our end, we could surely do a better job to communicate changes in the >> pageview definition code for anybody interested to review/comment/ask for >> documentation. >> Emails have been sent regularly about updates on the analytics list, except >> in the past few month. >> We shall get back to that good habit and send notifications with >> explanations of the changes. >> >> Joseph >> >> >> >> >> On Mon, Aug 17, 2015 at 5:15 PM, Oliver Keyes <oke...@wikimedia.org> wrote: >>> >>> You should also note that donate-wiki pageviews are making it into the >>> counts (again, the definition was designed to exclude these). >>> >>> Whose job is it to review pageviews and update the definition when >>> issues are found? >>> >>> On 17 August 2015 at 10:32, Oliver Keyes <oke...@wikimedia.org> wrote: >>>> Just to clarify; there is no need to ask me before making changes >>>> (obviously I find my approval for pageviews changes being sought >>>> incredibly flattering, but I am not the only person involved in this >>>> project ;p). What I'm more driving towards is directly informing >>>> customers when the definition is adapted. >>>> >>>> On 17 August 2015 at 10:31, Oliver Keyes <oke...@wikimedia.org> wrote: >>>>> Excellent; thank you. >>>>> >>>>> On 17 August 2015 at 04:42, Joseph Allemandou >>>>> <jalleman...@wikimedia.org> wrote: >>>>>> Oliver, >>>>>> >>>>>> It was a mistake from me to add the 'outreach' subdomain without >>>>>> asking you. >>>>>> >>>>>> From a documentation perspective, the analytics team uses that place >>>>>> to >>>>>> document changes: >>>>>> https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest and I >>>>>> didn't >>>>>> know about up-to-date documentation you sent. >>>>>> >>>>>> Tickets have been created to both correct the bug and update the >>>>>> documentation pages. >>>>>> >>>>>> Joseph >>>>>> >>>>>> >>>>>> >>>>>> On Sun, Aug 16, 2015 at 8:47 PM, Oliver Keyes <oke...@wikimedia.org> >>>>>> wrote: >>>>>>> >>>>>>> Ah, I see the problem; someone patched it and never documented it. >>>>>>> >>>>>>> We have documentation at >>>>>>> >>>>>>> https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters >>>>>>> of the generalised filters. There is also a log, on >>>>>>> https://meta.wikimedia.org/wiki/Research:Page_view, of changes to the >>>>>>> pageview definition. >>>>>>> >>>>>>> The intent behind both the transparent definition and the log is to >>>>>>> ensure that we know what is going /in/ the definition. >>>>>>> >>>>>>> In this case, somebody has patched the definition >>>>>>> >>>>>>> >>>>>>> (https://github.com/wikimedia/analytics-refinery-source/commit/cc0b6ed7e4f403eaa82235ec6a0f27152b0c2710) >>>>>>> to include traffic from outreach.wikimedia.org - a site that was very >>>>>>> deliberately and very explicitly excluded from the definition as it >>>>>>> was written. >>>>>>> >>>>>>> There is no explanation of why this change was made, there is no >>>>>>> documentation of this change even existing outside the actual >>>>>>> Java.... >>>>>>> can someone please explain what this is for, and update all the >>>>>>> documentation to reflect that? And then could people be very, very >>>>>>> clear in future that it is expected there be a log of alterations you >>>>>>> make to high-level KPIs beyond the, you know, commit logs. >>>>>>> >>>>>>> On 16 August 2015 at 14:32, Madhumitha Viswanathan >>>>>>> <mviswanat...@wikimedia.org> wrote: >>>>>>>> The new one. >>>>>>>> >>>>>>>> The code that generates it - >>>>>>>> >>>>>>>> - >>>>>>>> >>>>>>>> >>>>>>>> https://github.com/wikimedia/analytics-refinery/blob/master/hive/pageview/hourly/create_pageview_hourly_table.hql >>>>>>>> - >>>>>>>> >>>>>>>> >>>>>>>> https://github.com/wikimedia/analytics-refinery/tree/master/oozie/pageview/hourly >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Sun, Aug 16, 2015 at 11:01 AM, Oliver Keyes >>>>>>>> <oke...@wikimedia.org> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Is the pageviews_hourly table meant to contain pageviews according >>>>>>>>> to >>>>>>>>> the new or old definition? If old, where can I find aggregates for >>>>>>>>> the >>>>>>>>> new one? >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Oliver Keyes >>>>>>>>> Count Logula >>>>>>>>> Wikimedia Foundation >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Analytics mailing list >>>>>>>>> Analytics@lists.wikimedia.org >>>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> --Madhu :) >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Analytics mailing list >>>>>>>> Analytics@lists.wikimedia.org >>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Oliver Keyes >>>>>>> Count Logula >>>>>>> Wikimedia Foundation >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Analytics mailing list >>>>>>> Analytics@lists.wikimedia.org >>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Joseph Allemandou >>>>>> Data Engineer @ Wikimedia Foundation >>>>>> IRC: joal >>>>>> >>>>>> _______________________________________________ >>>>>> Analytics mailing list >>>>>> Analytics@lists.wikimedia.org >>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Oliver Keyes >>>>> Count Logula >>>>> Wikimedia Foundation >>>> >>>> >>>> >>>> -- >>>> Oliver Keyes >>>> Count Logula >>>> Wikimedia Foundation >>> >>> >>> >>> -- >>> Oliver Keyes >>> Count Logula >>> Wikimedia Foundation >>> >>> _______________________________________________ >>> Analytics mailing list >>> Analytics@lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> >> >> >> -- >> Joseph Allemandou >> Data Engineer @ Wikimedia Foundation >> IRC: joal >> >> _______________________________________________ >> Analytics mailing list >> Analytics@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/analytics >> > > > > -- > Oliver Keyes > Count Logula > Wikimedia Foundation > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics