Toby, that's right.  We're moving forward to generate Hive queries that
will represent the formal specification.

-Aaron

On Mon, Dec 15, 2014 at 9:12 AM, Oliver Keyes <oke...@wikimedia.org> wrote:

> We've written the draft Hive queries and I'm reviewing them with Otto now.
> Currently blocked on Hadoop heapsize issues, but I'm sure we'll work it
> through :).
>
> On 15 December 2014 at 12:10, Toby Negrin <tneg...@wikimedia.org> wrote:
>>
>> Hi Aaron, all --
>>
>> I haven't seen any discussion on this which is a sign that we can forward
>> with turning over the draft. Thoughts?
>>
>> thanks,
>>
>> -Toby
>>
>> On Tue, Dec 9, 2014 at 5:15 PM, Aaron Halfaker <ahalfa...@wikimedia.org>
>> wrote:
>>
>>> Hey folks,
>>>
>>> As discussions on the new page view definition have been calming down,
>>> we're preparing to deliver a draft version to the Devs.  I want to make
>>> sure that we all know the status and that any substantial concerns are
>>> raised before we hand things off on *Friday, Dec 12th.*
>>>
>>> For this phase, we are delivering the general filter[1].  This is the
>>> highest level filter, and exists primarily to distinguish requests worthy
>>> of further evaluation. Our plan is to take the definition as it exists on
>>> the 12th, and begin generating high-level aggregate numbers based on it. In
>>> future iterations, we will be digging into different breakdowns of this
>>> metric, and iterating on it to handle any inconsistencies or unexpected
>>> results.  There's a few differences from Web Stat Collector's (WSC) version
>>> of the general filter that we want to call to your attention to.
>>>
>>>    - We include searches -- WSC explicitly excludes them.
>>>    - We include Apps traffic -- WSC does not detect Apps traffic
>>>    - We include variants of /wiki/ (e.g. /zh-tw/, /zh-cn/, /sr-ec/) --
>>>    WSC hardcodes "/wiki/"
>>>    - We don't include Banner impressions -- WSC includes them.
>>>
>>> There are also some known issues with the new definition that are worth
>>> your notice:
>>>
>>>
>>>    1. *Internal traffic is counted*
>>>
>>>
>>>    - Note that WSC filters some internal traffic by hardcoding a set of
>>>    IPs in the definition.  We are working on parsing puppet templates in 
>>> order
>>>    to automatically detect which IPs represent internal traffic.  This will 
>>> be
>>>    a /better/ solution, but it's not quite ready yet because parsing puppet 
>>> is
>>>    hard.
>>>
>>>
>>>    1. *Spider traffic is counted*
>>>
>>>
>>>    - We will be using the User-agent field to detect and flag
>>>    spider-based traffic.  This "tag definition" will be delivered in a
>>>    subsequent definition.  This actually matches WSC, which does not filter
>>>    spider for the high-level metrics.
>>>
>>> These are problems we're aware of, and will be factoring in as we go
>>> forward with our next task: refining the definition using real,
>>> hourly-level traffic data. Thanks to everyone who has given feedback and
>>> participated in the process thus far, particularly Nemo, Erik, and
>>> Christian.
>>>
>>> 1.
>>> https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters
>>>
>>> -Aaron & Oliver
>>>
>>> _______________________________________________
>>> Analytics mailing list
>>> Analytics@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>> _______________________________________________
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
> --
> Oliver Keyes
> Research Analyst
> Wikimedia Foundation
>
> _______________________________________________
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to