Yay! Will validate/patch/poke tomorrow :). If it works, presumably we'll
want the output fired over to limn.

On 15 December 2014 at 19:01, Andrew Otto <ao...@wikimedia.org> wrote:
>
> This needs more testing!  Validation!  Etc.  But woo!
> https://gerrit.wikimedia.org/r/#/c/180023
>
> This let’s you do:
>
>
>
> ADD JAR /home/otto/refinery-hive-0.0.3-pageview.jar;
>
> CREATE TEMPORARY FUNCTION is_pageview as
> 'org.wikimedia.analytics.refinery.hive.IsPageviewUDF’;
>
>   SELECT
>     LOWER(uri_host) as uri_host,
>     count(*) as pageview_count
>   FROM
>     wmf_raw.webrequest
>   WHERE
>    (webrequest_source = 'text' or webrequest_source = 'mobile')
>     AND year=2014
>     AND month=12
>     AND day=7
>     AND hour=12
>     AND is_pageview(LOWER(uri_host), uri_path, http_status, content_type)
>   GROUP BY
>     LOWER(uri_host)
>   ORDER BY pageview_count desc
>   LIMIT 10
> ;
>
> …
>
> uri_host         pageview_count
>
> en.wikipedia.org 6613046
> en.m.wikipedia.org 3223273
> ru.wikipedia.org 2119850
> ja.m.wikipedia.org 1501954
> ja.wikipedia.org 1411533
> de.wikipedia.org 1330252
> zh.wikipedia.org 949228
> fr.wikipedia.org 939602
> commons.wikimedia.org 912965
> de.m.wikipedia.org 664661
>
> Time taken: 94.295 seconds, Fetched: 10 row(s)
>
>
>
> On Dec 15, 2014, at 16:02, Dario Taraborelli <dtarabore...@wikimedia.org>
> wrote:
>
> Oliver, Aaron – thanks for pushing this forward! Glad that we’re moving on
> with the implementation.
>
> On Dec 15, 2014, at 11:32 AM, Oliver Keyes <oke...@wikimedia.org> wrote:
>
> Totally!
>
> On 15 December 2014 at 14:22, Andrew Otto <ao...@wikimedia.org> wrote:
>>
>> Ah cool, didn’t realize there was a neutral definition.  We should call
>> that the ‘formal specification’ then.
>>
>> ...of course, now that I've said that, cosmic irony demands we end up
>> implementing in C, or something.
>>
>> Hm, a UDF that does this rather than a Hive query would probably be
>> better.  E.g.
>>
>>   SELECT
>>     request_qualifier(uri_host),
>>     count(*)
>>   FROM
>>     wmf_raw.webrequest
>>   WHERE
>>     is_pageview(uri_host, uri_path, http_status, content_type)
>>   GROUP BY
>>     request_qualifier(uri_host)
>>   ;
>>
>>
>> Or something like that.
>>
>> -Ao
>>
>>
>>
>>
>>
>>
>> On Dec 15, 2014, at 14:07, Oliver Keyes <oke...@wikimedia.org> wrote:
>>
>> It's totally tech-agnostic; the neutral definition is on meta. The hive
>> query is just because, since we suspect that's how we'll be generating the
>> data, it makes sense to turn the draft def into HQL for exploratory queries
>> and testing.
>>
>> ...of course, now that I've said that, cosmic irony demands we end up
>> implementing in C, or something.
>>
>> On 15 December 2014 at 13:46, Toby Negrin <tneg...@wikimedia.org> wrote:
>>>
>>> I think the hive code is "representative" in that it's an
>>> implementation. It's certainly not the only permitted one.
>>>
>>> On Dec 15, 2014, at 10:34 AM, Andrew Otto <ao...@wikimedia.org> wrote:
>>>
>>>  We're moving forward to generate Hive queries that will represent the
>>> formal specification.
>>>
>>> Should a specific implementation (e.g. Hive) represent the formal
>>> specification?  I tend to think it should be tech-agnostic, no?
>>>
>>>
>>>
>>> On Dec 15, 2014, at 12:15, Aaron Halfaker <ahalfa...@wikimedia.org>
>>> wrote:
>>>
>>> Toby, that's right.  We're moving forward to generate Hive queries that
>>> will represent the formal specification.
>>>
>>> -Aaron
>>>
>>> On Mon, Dec 15, 2014 at 9:12 AM, Oliver Keyes <oke...@wikimedia.org>
>>> wrote:
>>>
>>>> We've written the draft Hive queries and I'm reviewing them with Otto
>>>> now. Currently blocked on Hadoop heapsize issues, but I'm sure we'll work
>>>> it through :).
>>>>
>>>> On 15 December 2014 at 12:10, Toby Negrin <tneg...@wikimedia.org>
>>>> wrote:
>>>>>
>>>>> Hi Aaron, all --
>>>>>
>>>>> I haven't seen any discussion on this which is a sign that we can
>>>>> forward with turning over the draft. Thoughts?
>>>>>
>>>>> thanks,
>>>>>
>>>>> -Toby
>>>>>
>>>>> On Tue, Dec 9, 2014 at 5:15 PM, Aaron Halfaker <
>>>>> ahalfa...@wikimedia.org> wrote:
>>>>>
>>>>>> Hey folks,
>>>>>>
>>>>>> As discussions on the new page view definition have been calming
>>>>>> down, we're preparing to deliver a draft version to the Devs.  I want to
>>>>>> make sure that we all know the status and that any substantial concerns 
>>>>>> are
>>>>>> raised before we hand things off on *Friday, Dec 12th.*
>>>>>>
>>>>>> For this phase, we are delivering the general filter[1].  This is the
>>>>>> highest level filter, and exists primarily to distinguish requests worthy
>>>>>> of further evaluation. Our plan is to take the definition as it exists on
>>>>>> the 12th, and begin generating high-level aggregate numbers based on it. 
>>>>>> In
>>>>>> future iterations, we will be digging into different breakdowns of this
>>>>>> metric, and iterating on it to handle any inconsistencies or unexpected
>>>>>> results.  There's a few differences from Web Stat Collector's (WSC) 
>>>>>> version
>>>>>> of the general filter that we want to call to your attention to.
>>>>>>
>>>>>>    - We include searches -- WSC explicitly excludes them.
>>>>>>    - We include Apps traffic -- WSC does not detect Apps traffic
>>>>>>    - We include variants of /wiki/ (e.g. /zh-tw/, /zh-cn/, /sr-ec/)
>>>>>>    -- WSC hardcodes "/wiki/"
>>>>>>    - We don't include Banner impressions -- WSC includes them.
>>>>>>
>>>>>> There are also some known issues with the new definition that are
>>>>>> worth your notice:
>>>>>>
>>>>>>
>>>>>>    1. *Internal traffic is counted*
>>>>>>
>>>>>>
>>>>>>    - Note that WSC filters some internal traffic by hardcoding a set
>>>>>>    of IPs in the definition.  We are working on parsing puppet templates 
>>>>>> in
>>>>>>    order to automatically detect which IPs represent internal traffic.  
>>>>>> This
>>>>>>    will be a /better/ solution, but it's not quite ready yet because 
>>>>>> parsing
>>>>>>    puppet is hard.
>>>>>>
>>>>>>
>>>>>>    1. *Spider traffic is counted*
>>>>>>
>>>>>>
>>>>>>    - We will be using the User-agent field to detect and flag
>>>>>>    spider-based traffic.  This "tag definition" will be delivered in a
>>>>>>    subsequent definition.  This actually matches WSC, which does not 
>>>>>> filter
>>>>>>    spider for the high-level metrics.
>>>>>>
>>>>>> These are problems we're aware of, and will be factoring in as we go
>>>>>> forward with our next task: refining the definition using real,
>>>>>> hourly-level traffic data. Thanks to everyone who has given feedback and
>>>>>> participated in the process thus far, particularly Nemo, Erik, and
>>>>>> Christian.
>>>>>>
>>>>>> 1.
>>>>>> https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters
>>>>>>
>>>>>> -Aaron & Oliver
>>>>>>
>>>>>> _______________________________________________
>>>>>> Analytics mailing list
>>>>>> Analytics@lists.wikimedia.org
>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> Analytics mailing list
>>>>> Analytics@lists.wikimedia.org
>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>
>>>>>
>>>>
>>>> --
>>>> Oliver Keyes
>>>> Research Analyst
>>>> Wikimedia Foundation
>>>>
>>>> _______________________________________________
>>>> Analytics mailing list
>>>> Analytics@lists.wikimedia.org
>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>
>>>>
>>> _______________________________________________
>>> Analytics mailing list
>>> Analytics@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>> _______________________________________________
>>> Analytics mailing list
>>> Analytics@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>> _______________________________________________
>>> Analytics mailing list
>>> Analytics@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>
>> --
>> Oliver Keyes
>> Research Analyst
>> Wikimedia Foundation
>>  _______________________________________________
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>>
>> _______________________________________________
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
> --
> Oliver Keyes
> Research Analyst
> Wikimedia Foundation
>  _______________________________________________
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
> _______________________________________________
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
>
> _______________________________________________
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>

-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to