That might be a email copy/paste problem; I see the non-symetrical quotes in my email.
> On Dec 16, 2014, at 09:38, Toby Negrin <tneg...@wikimedia.org> wrote: > > Note that in Oliver's example, the quotes are double quotes, not single > quotes. I didn't see the difference immediately. > > -Toby > > On Tue, Dec 16, 2014 at 6:22 AM, Oliver Keyes <oke...@wikimedia.org > <mailto:oke...@wikimedia.org>> wrote: > Note that Andrew's example code doesn't run (at least, for me) because it > needs to be: > > CREATE TEMPORARY FUNCTION is_pageview as > "org.wikimedia.analytics.refinery.hive.IsPageviewUDF"; > > Hive gets stupider every time I try to use it ;p > > On 15 December 2014 at 20:47, Oliver Keyes <oke...@wikimedia.org > <mailto:oke...@wikimedia.org>> wrote: > Yay! Will validate/patch/poke tomorrow :). If it works, presumably we'll want > the output fired over to limn. > > On 15 December 2014 at 19:01, Andrew Otto <ao...@wikimedia.org > <mailto:ao...@wikimedia.org>> wrote: > This needs more testing! Validation! Etc. But woo! > https://gerrit.wikimedia.org/r/#/c/180023 > <https://gerrit.wikimedia.org/r/#/c/180023> > > This let’s you do: > > > > ADD JAR /home/otto/refinery-hive-0.0.3-pageview.jar; > > CREATE TEMPORARY FUNCTION is_pageview as > 'org.wikimedia.analytics.refinery.hive.IsPageviewUDF’; > > SELECT > LOWER(uri_host) as uri_host, > count(*) as pageview_count > FROM > wmf_raw.webrequest > WHERE > (webrequest_source = 'text' or webrequest_source = 'mobile') > AND year=2014 > AND month=12 > AND day=7 > AND hour=12 > AND is_pageview(LOWER(uri_host), uri_path, http_status, content_type) > GROUP BY > LOWER(uri_host) > ORDER BY pageview_count desc > LIMIT 10 > ; > > … > > uri_host pageview_count > > en.wikipedia.org <http://en.wikipedia.org/> 6613046 > en.m.wikipedia.org <http://en.m.wikipedia.org/> 3223273 > ru.wikipedia.org <http://ru.wikipedia.org/> 2119850 > ja.m.wikipedia.org <http://ja.m.wikipedia.org/> 1501954 > ja.wikipedia.org <http://ja.wikipedia.org/> 1411533 > de.wikipedia.org <http://de.wikipedia.org/> 1330252 > zh.wikipedia.org <http://zh.wikipedia.org/> 949228 > fr.wikipedia.org <http://fr.wikipedia.org/> 939602 > commons.wikimedia.org <http://commons.wikimedia.org/> 912965 > de.m.wikipedia.org <http://de.m.wikipedia.org/> 664661 > > Time taken: 94.295 seconds, Fetched: 10 row(s) > > > >> On Dec 15, 2014, at 16:02, Dario Taraborelli <dtarabore...@wikimedia.org >> <mailto:dtarabore...@wikimedia.org>> wrote: >> >> Oliver, Aaron – thanks for pushing this forward! Glad that we’re moving on >> with the implementation. >> >>> On Dec 15, 2014, at 11:32 AM, Oliver Keyes <oke...@wikimedia.org >>> <mailto:oke...@wikimedia.org>> wrote: >>> >>> Totally! >>> >>> On 15 December 2014 at 14:22, Andrew Otto <ao...@wikimedia.org >>> <mailto:ao...@wikimedia.org>> wrote: >>> Ah cool, didn’t realize there was a neutral definition. We should call >>> that the ‘formal specification’ then. >>> >>>> ...of course, now that I've said that, cosmic irony demands we end up >>>> implementing in C, or something. >>> Hm, a UDF that does this rather than a Hive query would probably be better. >>> E.g. >>> >>> SELECT >>> request_qualifier(uri_host), >>> count(*) >>> FROM >>> wmf_raw.webrequest >>> WHERE >>> is_pageview(uri_host, uri_path, http_status, content_type) >>> GROUP BY >>> request_qualifier(uri_host) >>> ; >>> >>> >>> Or something like that. >>> >>> -Ao >>> >>> >>> >>> >>> >>> >>>> On Dec 15, 2014, at 14:07, Oliver Keyes <oke...@wikimedia.org >>>> <mailto:oke...@wikimedia.org>> wrote: >>>> >>>> It's totally tech-agnostic; the neutral definition is on meta. The hive >>>> query is just because, since we suspect that's how we'll be generating the >>>> data, it makes sense to turn the draft def into HQL for exploratory >>>> queries and testing. >>>> >>>> ...of course, now that I've said that, cosmic irony demands we end up >>>> implementing in C, or something. >>>> >>>> On 15 December 2014 at 13:46, Toby Negrin <tneg...@wikimedia.org >>>> <mailto:tneg...@wikimedia.org>> wrote: >>>> I think the hive code is "representative" in that it's an implementation. >>>> It's certainly not the only permitted one. >>>> >>>> On Dec 15, 2014, at 10:34 AM, Andrew Otto <ao...@wikimedia.org >>>> <mailto:ao...@wikimedia.org>> wrote: >>>> >>>>>> We're moving forward to generate Hive queries that will represent the >>>>>> formal specification. >>>>> Should a specific implementation (e.g. Hive) represent the formal >>>>> specification? I tend to think it should be tech-agnostic, no? >>>>> >>>>> >>>>> >>>>>> On Dec 15, 2014, at 12:15, Aaron Halfaker <ahalfa...@wikimedia.org >>>>>> <mailto:ahalfa...@wikimedia.org>> wrote: >>>>>> >>>>>> Toby, that's right. We're moving forward to generate Hive queries that >>>>>> will represent the formal specification. >>>>>> >>>>>> -Aaron >>>>>> >>>>>> On Mon, Dec 15, 2014 at 9:12 AM, Oliver Keyes <oke...@wikimedia.org >>>>>> <mailto:oke...@wikimedia.org>> wrote: >>>>>> We've written the draft Hive queries and I'm reviewing them with Otto >>>>>> now. Currently blocked on Hadoop heapsize issues, but I'm sure we'll >>>>>> work it through :). >>>>>> >>>>>> On 15 December 2014 at 12:10, Toby Negrin <tneg...@wikimedia.org >>>>>> <mailto:tneg...@wikimedia.org>> wrote: >>>>>> Hi Aaron, all -- >>>>>> >>>>>> I haven't seen any discussion on this which is a sign that we can >>>>>> forward with turning over the draft. Thoughts? >>>>>> >>>>>> thanks, >>>>>> >>>>>> -Toby >>>>>> >>>>>> On Tue, Dec 9, 2014 at 5:15 PM, Aaron Halfaker <ahalfa...@wikimedia.org >>>>>> <mailto:ahalfa...@wikimedia.org>> wrote: >>>>>> Hey folks, >>>>>> >>>>>> As discussions on the new page view definition have been calming down, >>>>>> we're preparing to deliver a draft version to the Devs. I want to make >>>>>> sure that we all know the status and that any substantial concerns are >>>>>> raised before we hand things off on Friday, Dec 12th. >>>>>> >>>>>> For this phase, we are delivering the general filter[1]. This is the >>>>>> highest level filter, and exists primarily to distinguish requests >>>>>> worthy of further evaluation. Our plan is to take the definition as it >>>>>> exists on the 12th, and begin generating high-level aggregate numbers >>>>>> based on it. In future iterations, we will be digging into different >>>>>> breakdowns of this metric, and iterating on it to handle any >>>>>> inconsistencies or unexpected results. There's a few differences from >>>>>> Web Stat Collector's (WSC) version of the general filter that we want to >>>>>> call to your attention to. >>>>>> We include searches -- WSC explicitly excludes them. >>>>>> We include Apps traffic -- WSC does not detect Apps traffic >>>>>> We include variants of /wiki/ (e.g. /zh-tw/, /zh-cn/, /sr-ec/) -- WSC >>>>>> hardcodes "/wiki/" >>>>>> We don't include Banner impressions -- WSC includes them. >>>>>> There are also some known issues with the new definition that are worth >>>>>> your notice: >>>>>> >>>>>> Internal traffic is counted >>>>>> Note that WSC filters some internal traffic by hardcoding a set of IPs >>>>>> in the definition. We are working on parsing puppet templates in order >>>>>> to automatically detect which IPs represent internal traffic. This will >>>>>> be a /better/ solution, but it's not quite ready yet because parsing >>>>>> puppet is hard. >>>>>> Spider traffic is counted >>>>>> We will be using the User-agent field to detect and flag spider-based >>>>>> traffic. This "tag definition" will be delivered in a subsequent >>>>>> definition. This actually matches WSC, which does not filter spider for >>>>>> the high-level metrics. >>>>>> These are problems we're aware of, and will be factoring in as we go >>>>>> forward with our next task: refining the definition using real, >>>>>> hourly-level traffic data. Thanks to everyone who has given feedback and >>>>>> participated in the process thus far, particularly Nemo, Erik, and >>>>>> Christian. >>>>>> >>>>>> 1. >>>>>> https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters >>>>>> <https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters> >>>>>> >>>>>> -Aaron & Oliver >>>>>> >>>>>> _______________________________________________ >>>>>> Analytics mailing list >>>>>> Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> >>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>> <https://lists.wikimedia.org/mailman/listinfo/analytics> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Analytics mailing list >>>>>> Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> >>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>> <https://lists.wikimedia.org/mailman/listinfo/analytics> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Oliver Keyes >>>>>> Research Analyst >>>>>> Wikimedia Foundation >>>>>> >>>>>> _______________________________________________ >>>>>> Analytics mailing list >>>>>> Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> >>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>> <https://lists.wikimedia.org/mailman/listinfo/analytics> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Analytics mailing list >>>>>> Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> >>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>> <https://lists.wikimedia.org/mailman/listinfo/analytics> >>>>> >>>>> _______________________________________________ >>>>> Analytics mailing list >>>>> Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> >>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>> <https://lists.wikimedia.org/mailman/listinfo/analytics> >>>> >>>> _______________________________________________ >>>> Analytics mailing list >>>> Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> <https://lists.wikimedia.org/mailman/listinfo/analytics> >>>> >>>> >>>> >>>> -- >>>> Oliver Keyes >>>> Research Analyst >>>> Wikimedia Foundation >>>> _______________________________________________ >>>> Analytics mailing list >>>> Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> <https://lists.wikimedia.org/mailman/listinfo/analytics> >>> >>> >>> _______________________________________________ >>> Analytics mailing list >>> Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> <https://lists.wikimedia.org/mailman/listinfo/analytics> >>> >>> >>> >>> -- >>> Oliver Keyes >>> Research Analyst >>> Wikimedia Foundation >>> _______________________________________________ >>> Analytics mailing list >>> Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> <https://lists.wikimedia.org/mailman/listinfo/analytics> >> >> _______________________________________________ >> Analytics mailing list >> Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> >> https://lists.wikimedia.org/mailman/listinfo/analytics >> <https://lists.wikimedia.org/mailman/listinfo/analytics> > > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> > https://lists.wikimedia.org/mailman/listinfo/analytics > <https://lists.wikimedia.org/mailman/listinfo/analytics> > > > > -- > Oliver Keyes > Research Analyst > Wikimedia Foundation > > > -- > Oliver Keyes > Research Analyst > Wikimedia Foundation > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> > https://lists.wikimedia.org/mailman/listinfo/analytics > <https://lists.wikimedia.org/mailman/listinfo/analytics> > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics