Oliver, Aaron – thanks for pushing this forward! Glad that we’re moving on with 
the implementation.

> On Dec 15, 2014, at 11:32 AM, Oliver Keyes <oke...@wikimedia.org> wrote:
> 
> Totally!
> 
> On 15 December 2014 at 14:22, Andrew Otto <ao...@wikimedia.org 
> <mailto:ao...@wikimedia.org>> wrote:
> Ah cool, didn’t realize there was a neutral definition.  We should call that 
> the ‘formal specification’ then.
> 
>> ...of course, now that I've said that, cosmic irony demands we end up 
>> implementing in C, or something.
> Hm, a UDF that does this rather than a Hive query would probably be better.  
> E.g.
> 
>   SELECT
>     request_qualifier(uri_host),
>     count(*)
>   FROM
>     wmf_raw.webrequest
>   WHERE
>     is_pageview(uri_host, uri_path, http_status, content_type)
>   GROUP BY
>     request_qualifier(uri_host)
>   ;
> 
> 
> Or something like that.
> 
> -Ao
> 
> 
> 
> 
> 
> 
>> On Dec 15, 2014, at 14:07, Oliver Keyes <oke...@wikimedia.org 
>> <mailto:oke...@wikimedia.org>> wrote:
>> 
>> It's totally tech-agnostic; the neutral definition is on meta. The hive 
>> query is just because, since we suspect that's how we'll be generating the 
>> data, it makes sense to turn the draft def into HQL for exploratory queries 
>> and testing.
>> 
>> ...of course, now that I've said that, cosmic irony demands we end up 
>> implementing in C, or something.
>> 
>> On 15 December 2014 at 13:46, Toby Negrin <tneg...@wikimedia.org 
>> <mailto:tneg...@wikimedia.org>> wrote:
>> I think the hive code is "representative" in that it's an implementation. 
>> It's certainly not the only permitted one. 
>> 
>> On Dec 15, 2014, at 10:34 AM, Andrew Otto <ao...@wikimedia.org 
>> <mailto:ao...@wikimedia.org>> wrote:
>> 
>>>>  We're moving forward to generate Hive queries that will represent the 
>>>> formal specification.
>>> Should a specific implementation (e.g. Hive) represent the formal 
>>> specification?  I tend to think it should be tech-agnostic, no?
>>> 
>>> 
>>> 
>>>> On Dec 15, 2014, at 12:15, Aaron Halfaker <ahalfa...@wikimedia.org 
>>>> <mailto:ahalfa...@wikimedia.org>> wrote:
>>>> 
>>>> Toby, that's right.  We're moving forward to generate Hive queries that 
>>>> will represent the formal specification.  
>>>> 
>>>> -Aaron
>>>> 
>>>> On Mon, Dec 15, 2014 at 9:12 AM, Oliver Keyes <oke...@wikimedia.org 
>>>> <mailto:oke...@wikimedia.org>> wrote:
>>>> We've written the draft Hive queries and I'm reviewing them with Otto now. 
>>>> Currently blocked on Hadoop heapsize issues, but I'm sure we'll work it 
>>>> through :).
>>>> 
>>>> On 15 December 2014 at 12:10, Toby Negrin <tneg...@wikimedia.org 
>>>> <mailto:tneg...@wikimedia.org>> wrote:
>>>> Hi Aaron, all --
>>>> 
>>>> I haven't seen any discussion on this which is a sign that we can forward 
>>>> with turning over the draft. Thoughts?
>>>> 
>>>> thanks,
>>>> 
>>>> -Toby
>>>> 
>>>> On Tue, Dec 9, 2014 at 5:15 PM, Aaron Halfaker <ahalfa...@wikimedia.org 
>>>> <mailto:ahalfa...@wikimedia.org>> wrote:
>>>> Hey folks,
>>>> 
>>>> As discussions on the new page view definition have been calming down, 
>>>> we're preparing to deliver a draft version to the Devs.  I want to make 
>>>> sure that we all know the status and that any substantial concerns are 
>>>> raised before we hand things off on Friday, Dec 12th.
>>>> 
>>>> For this phase, we are delivering the general filter[1].  This is the 
>>>> highest level filter, and exists primarily to distinguish requests worthy 
>>>> of further evaluation. Our plan is to take the definition as it exists on 
>>>> the 12th, and begin generating high-level aggregate numbers based on it. 
>>>> In future iterations, we will be digging into different breakdowns of this 
>>>> metric, and iterating on it to handle any inconsistencies or unexpected 
>>>> results.  There's a few differences from Web Stat Collector's (WSC) 
>>>> version of the general filter that we want to call to your attention to.
>>>> We include searches -- WSC explicitly excludes them.
>>>> We include Apps traffic -- WSC does not detect Apps traffic
>>>> We include variants of /wiki/ (e.g. /zh-tw/, /zh-cn/, /sr-ec/) -- WSC 
>>>> hardcodes "/wiki/"
>>>> We don't include Banner impressions -- WSC includes them.
>>>> There are also some known issues with the new definition that are worth 
>>>> your notice:
>>>>     
>>>> Internal traffic is counted
>>>> Note that WSC filters some internal traffic by hardcoding a set of IPs in 
>>>> the definition.  We are working on parsing puppet templates in order to 
>>>> automatically detect which IPs represent internal traffic.  This will be a 
>>>> /better/ solution, but it's not quite ready yet because parsing puppet is 
>>>> hard.  
>>>> Spider traffic is counted
>>>> We will be using the User-agent field to detect and flag spider-based 
>>>> traffic.  This "tag definition" will be delivered in a subsequent 
>>>> definition.  This actually matches WSC, which does not filter spider for 
>>>> the high-level metrics.
>>>> These are problems we're aware of, and will be factoring in as we go 
>>>> forward with our next task: refining the definition using real, 
>>>> hourly-level traffic data. Thanks to everyone who has given feedback and 
>>>> participated in the process thus far, particularly Nemo, Erik, and 
>>>> Christian.
>>>> 
>>>> 1. https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters 
>>>> <https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters>
>>>> 
>>>> -Aaron & Oliver
>>>> 
>>>> _______________________________________________
>>>> Analytics mailing list
>>>> Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>
>>>> https://lists.wikimedia.org/mailman/listinfo/analytics 
>>>> <https://lists.wikimedia.org/mailman/listinfo/analytics>
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Analytics mailing list
>>>> Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>
>>>> https://lists.wikimedia.org/mailman/listinfo/analytics 
>>>> <https://lists.wikimedia.org/mailman/listinfo/analytics>
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> Oliver Keyes
>>>> Research Analyst
>>>> Wikimedia Foundation
>>>> 
>>>> _______________________________________________
>>>> Analytics mailing list
>>>> Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>
>>>> https://lists.wikimedia.org/mailman/listinfo/analytics 
>>>> <https://lists.wikimedia.org/mailman/listinfo/analytics>
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Analytics mailing list
>>>> Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>
>>>> https://lists.wikimedia.org/mailman/listinfo/analytics 
>>>> <https://lists.wikimedia.org/mailman/listinfo/analytics>
>>> 
>>> _______________________________________________
>>> Analytics mailing list
>>> Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>
>>> https://lists.wikimedia.org/mailman/listinfo/analytics 
>>> <https://lists.wikimedia.org/mailman/listinfo/analytics>
>> 
>> _______________________________________________
>> Analytics mailing list
>> Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>
>> https://lists.wikimedia.org/mailman/listinfo/analytics 
>> <https://lists.wikimedia.org/mailman/listinfo/analytics>
>> 
>> 
>> 
>> -- 
>> Oliver Keyes
>> Research Analyst
>> Wikimedia Foundation
>> _______________________________________________
>> Analytics mailing list
>> Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>
>> https://lists.wikimedia.org/mailman/listinfo/analytics 
>> <https://lists.wikimedia.org/mailman/listinfo/analytics>
> 
> 
> _______________________________________________
> Analytics mailing list
> Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>
> https://lists.wikimedia.org/mailman/listinfo/analytics 
> <https://lists.wikimedia.org/mailman/listinfo/analytics>
> 
> 
> 
> -- 
> Oliver Keyes
> Research Analyst
> Wikimedia Foundation
> _______________________________________________
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to