That might be a email copy/paste problem; I see the non-symetrical quotes in my 
email.


> On Dec 16, 2014, at 09:38, Toby Negrin <tneg...@wikimedia.org> wrote:
> 
> Note that in Oliver's example, the quotes are double quotes, not single 
> quotes. I didn't see the difference immediately.
> 
> -Toby
> 
> On Tue, Dec 16, 2014 at 6:22 AM, Oliver Keyes <oke...@wikimedia.org 
> <mailto:oke...@wikimedia.org>> wrote:
> Note that Andrew's example code doesn't run (at least, for me) because it 
> needs to be:
> 
> CREATE TEMPORARY FUNCTION is_pageview as 
> "org.wikimedia.analytics.refinery.hive.IsPageviewUDF";
> 
> Hive gets stupider every time I try to use it ;p
> 
> On 15 December 2014 at 20:47, Oliver Keyes <oke...@wikimedia.org 
> <mailto:oke...@wikimedia.org>> wrote:
> Yay! Will validate/patch/poke tomorrow :). If it works, presumably we'll want 
> the output fired over to limn.
> 
> On 15 December 2014 at 19:01, Andrew Otto <ao...@wikimedia.org 
> <mailto:ao...@wikimedia.org>> wrote:
> This needs more testing!  Validation!  Etc.  But woo!
> https://gerrit.wikimedia.org/r/#/c/180023 
> <https://gerrit.wikimedia.org/r/#/c/180023>
> 
> This let’s you do:
> 
> 
> 
> ADD JAR /home/otto/refinery-hive-0.0.3-pageview.jar;
> 
> CREATE TEMPORARY FUNCTION is_pageview as 
> 'org.wikimedia.analytics.refinery.hive.IsPageviewUDF’;
> 
>   SELECT
>     LOWER(uri_host) as uri_host,
>     count(*) as pageview_count
>   FROM
>     wmf_raw.webrequest
>   WHERE
>    (webrequest_source = 'text' or webrequest_source = 'mobile')
>     AND year=2014
>     AND month=12
>     AND day=7
>     AND hour=12
>     AND is_pageview(LOWER(uri_host), uri_path, http_status, content_type)
>   GROUP BY
>     LOWER(uri_host)
>   ORDER BY pageview_count desc
>   LIMIT 10
> ;
> 
> …
> 
> uri_host              pageview_count
> 
> en.wikipedia.org <http://en.wikipedia.org/>   6613046
> en.m.wikipedia.org <http://en.m.wikipedia.org/>       3223273
> ru.wikipedia.org <http://ru.wikipedia.org/>   2119850
> ja.m.wikipedia.org <http://ja.m.wikipedia.org/>       1501954
> ja.wikipedia.org <http://ja.wikipedia.org/>   1411533
> de.wikipedia.org <http://de.wikipedia.org/>   1330252
> zh.wikipedia.org <http://zh.wikipedia.org/>   949228
> fr.wikipedia.org <http://fr.wikipedia.org/>   939602
> commons.wikimedia.org <http://commons.wikimedia.org/> 912965
> de.m.wikipedia.org <http://de.m.wikipedia.org/>       664661
> 
> Time taken: 94.295 seconds, Fetched: 10 row(s)
> 
> 
> 
>> On Dec 15, 2014, at 16:02, Dario Taraborelli <dtarabore...@wikimedia.org 
>> <mailto:dtarabore...@wikimedia.org>> wrote:
>> 
>> Oliver, Aaron – thanks for pushing this forward! Glad that we’re moving on 
>> with the implementation.
>> 
>>> On Dec 15, 2014, at 11:32 AM, Oliver Keyes <oke...@wikimedia.org 
>>> <mailto:oke...@wikimedia.org>> wrote:
>>> 
>>> Totally!
>>> 
>>> On 15 December 2014 at 14:22, Andrew Otto <ao...@wikimedia.org 
>>> <mailto:ao...@wikimedia.org>> wrote:
>>> Ah cool, didn’t realize there was a neutral definition.  We should call 
>>> that the ‘formal specification’ then.
>>> 
>>>> ...of course, now that I've said that, cosmic irony demands we end up 
>>>> implementing in C, or something.
>>> Hm, a UDF that does this rather than a Hive query would probably be better. 
>>>  E.g.
>>> 
>>>   SELECT
>>>     request_qualifier(uri_host),
>>>     count(*)
>>>   FROM
>>>     wmf_raw.webrequest
>>>   WHERE
>>>     is_pageview(uri_host, uri_path, http_status, content_type)
>>>   GROUP BY
>>>     request_qualifier(uri_host)
>>>   ;
>>> 
>>> 
>>> Or something like that.
>>> 
>>> -Ao
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>>> On Dec 15, 2014, at 14:07, Oliver Keyes <oke...@wikimedia.org 
>>>> <mailto:oke...@wikimedia.org>> wrote:
>>>> 
>>>> It's totally tech-agnostic; the neutral definition is on meta. The hive 
>>>> query is just because, since we suspect that's how we'll be generating the 
>>>> data, it makes sense to turn the draft def into HQL for exploratory 
>>>> queries and testing.
>>>> 
>>>> ...of course, now that I've said that, cosmic irony demands we end up 
>>>> implementing in C, or something.
>>>> 
>>>> On 15 December 2014 at 13:46, Toby Negrin <tneg...@wikimedia.org 
>>>> <mailto:tneg...@wikimedia.org>> wrote:
>>>> I think the hive code is "representative" in that it's an implementation. 
>>>> It's certainly not the only permitted one. 
>>>> 
>>>> On Dec 15, 2014, at 10:34 AM, Andrew Otto <ao...@wikimedia.org 
>>>> <mailto:ao...@wikimedia.org>> wrote:
>>>> 
>>>>>>  We're moving forward to generate Hive queries that will represent the 
>>>>>> formal specification.
>>>>> Should a specific implementation (e.g. Hive) represent the formal 
>>>>> specification?  I tend to think it should be tech-agnostic, no?
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Dec 15, 2014, at 12:15, Aaron Halfaker <ahalfa...@wikimedia.org 
>>>>>> <mailto:ahalfa...@wikimedia.org>> wrote:
>>>>>> 
>>>>>> Toby, that's right.  We're moving forward to generate Hive queries that 
>>>>>> will represent the formal specification.  
>>>>>> 
>>>>>> -Aaron
>>>>>> 
>>>>>> On Mon, Dec 15, 2014 at 9:12 AM, Oliver Keyes <oke...@wikimedia.org 
>>>>>> <mailto:oke...@wikimedia.org>> wrote:
>>>>>> We've written the draft Hive queries and I'm reviewing them with Otto 
>>>>>> now. Currently blocked on Hadoop heapsize issues, but I'm sure we'll 
>>>>>> work it through :).
>>>>>> 
>>>>>> On 15 December 2014 at 12:10, Toby Negrin <tneg...@wikimedia.org 
>>>>>> <mailto:tneg...@wikimedia.org>> wrote:
>>>>>> Hi Aaron, all --
>>>>>> 
>>>>>> I haven't seen any discussion on this which is a sign that we can 
>>>>>> forward with turning over the draft. Thoughts?
>>>>>> 
>>>>>> thanks,
>>>>>> 
>>>>>> -Toby
>>>>>> 
>>>>>> On Tue, Dec 9, 2014 at 5:15 PM, Aaron Halfaker <ahalfa...@wikimedia.org 
>>>>>> <mailto:ahalfa...@wikimedia.org>> wrote:
>>>>>> Hey folks,
>>>>>> 
>>>>>> As discussions on the new page view definition have been calming down, 
>>>>>> we're preparing to deliver a draft version to the Devs.  I want to make 
>>>>>> sure that we all know the status and that any substantial concerns are 
>>>>>> raised before we hand things off on Friday, Dec 12th.
>>>>>> 
>>>>>> For this phase, we are delivering the general filter[1].  This is the 
>>>>>> highest level filter, and exists primarily to distinguish requests 
>>>>>> worthy of further evaluation. Our plan is to take the definition as it 
>>>>>> exists on the 12th, and begin generating high-level aggregate numbers 
>>>>>> based on it. In future iterations, we will be digging into different 
>>>>>> breakdowns of this metric, and iterating on it to handle any 
>>>>>> inconsistencies or unexpected results.  There's a few differences from 
>>>>>> Web Stat Collector's (WSC) version of the general filter that we want to 
>>>>>> call to your attention to.
>>>>>> We include searches -- WSC explicitly excludes them.
>>>>>> We include Apps traffic -- WSC does not detect Apps traffic
>>>>>> We include variants of /wiki/ (e.g. /zh-tw/, /zh-cn/, /sr-ec/) -- WSC 
>>>>>> hardcodes "/wiki/"
>>>>>> We don't include Banner impressions -- WSC includes them.
>>>>>> There are also some known issues with the new definition that are worth 
>>>>>> your notice:
>>>>>>     
>>>>>> Internal traffic is counted
>>>>>> Note that WSC filters some internal traffic by hardcoding a set of IPs 
>>>>>> in the definition.  We are working on parsing puppet templates in order 
>>>>>> to automatically detect which IPs represent internal traffic.  This will 
>>>>>> be a /better/ solution, but it's not quite ready yet because parsing 
>>>>>> puppet is hard.  
>>>>>> Spider traffic is counted
>>>>>> We will be using the User-agent field to detect and flag spider-based 
>>>>>> traffic.  This "tag definition" will be delivered in a subsequent 
>>>>>> definition.  This actually matches WSC, which does not filter spider for 
>>>>>> the high-level metrics.
>>>>>> These are problems we're aware of, and will be factoring in as we go 
>>>>>> forward with our next task: refining the definition using real, 
>>>>>> hourly-level traffic data. Thanks to everyone who has given feedback and 
>>>>>> participated in the process thus far, particularly Nemo, Erik, and 
>>>>>> Christian.
>>>>>> 
>>>>>> 1. 
>>>>>> https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters 
>>>>>> <https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters>
>>>>>> 
>>>>>> -Aaron & Oliver
>>>>>> 
>>>>>> _______________________________________________
>>>>>> Analytics mailing list
>>>>>> Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>
>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics 
>>>>>> <https://lists.wikimedia.org/mailman/listinfo/analytics>
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> Analytics mailing list
>>>>>> Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>
>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics 
>>>>>> <https://lists.wikimedia.org/mailman/listinfo/analytics>
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> -- 
>>>>>> Oliver Keyes
>>>>>> Research Analyst
>>>>>> Wikimedia Foundation
>>>>>> 
>>>>>> _______________________________________________
>>>>>> Analytics mailing list
>>>>>> Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>
>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics 
>>>>>> <https://lists.wikimedia.org/mailman/listinfo/analytics>
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> Analytics mailing list
>>>>>> Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>
>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics 
>>>>>> <https://lists.wikimedia.org/mailman/listinfo/analytics>
>>>>> 
>>>>> _______________________________________________
>>>>> Analytics mailing list
>>>>> Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>
>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics 
>>>>> <https://lists.wikimedia.org/mailman/listinfo/analytics>
>>>> 
>>>> _______________________________________________
>>>> Analytics mailing list
>>>> Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>
>>>> https://lists.wikimedia.org/mailman/listinfo/analytics 
>>>> <https://lists.wikimedia.org/mailman/listinfo/analytics>
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> Oliver Keyes
>>>> Research Analyst
>>>> Wikimedia Foundation
>>>> _______________________________________________
>>>> Analytics mailing list
>>>> Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>
>>>> https://lists.wikimedia.org/mailman/listinfo/analytics 
>>>> <https://lists.wikimedia.org/mailman/listinfo/analytics>
>>> 
>>> 
>>> _______________________________________________
>>> Analytics mailing list
>>> Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>
>>> https://lists.wikimedia.org/mailman/listinfo/analytics 
>>> <https://lists.wikimedia.org/mailman/listinfo/analytics>
>>> 
>>> 
>>> 
>>> -- 
>>> Oliver Keyes
>>> Research Analyst
>>> Wikimedia Foundation
>>> _______________________________________________
>>> Analytics mailing list
>>> Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>
>>> https://lists.wikimedia.org/mailman/listinfo/analytics 
>>> <https://lists.wikimedia.org/mailman/listinfo/analytics>
>> 
>> _______________________________________________
>> Analytics mailing list
>> Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>
>> https://lists.wikimedia.org/mailman/listinfo/analytics 
>> <https://lists.wikimedia.org/mailman/listinfo/analytics>
> 
> 
> _______________________________________________
> Analytics mailing list
> Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>
> https://lists.wikimedia.org/mailman/listinfo/analytics 
> <https://lists.wikimedia.org/mailman/listinfo/analytics>
> 
> 
> 
> -- 
> Oliver Keyes
> Research Analyst
> Wikimedia Foundation
> 
> 
> -- 
> Oliver Keyes
> Research Analyst
> Wikimedia Foundation
> 
> _______________________________________________
> Analytics mailing list
> Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>
> https://lists.wikimedia.org/mailman/listinfo/analytics 
> <https://lists.wikimedia.org/mailman/listinfo/analytics>
> 
> _______________________________________________
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to