Oliver, Aaron – thanks for pushing this forward! Glad that we’re moving on with the implementation.
> On Dec 15, 2014, at 11:32 AM, Oliver Keyes <oke...@wikimedia.org> wrote: > > Totally! > > On 15 December 2014 at 14:22, Andrew Otto <ao...@wikimedia.org > <mailto:ao...@wikimedia.org>> wrote: > Ah cool, didn’t realize there was a neutral definition. We should call that > the ‘formal specification’ then. > >> ...of course, now that I've said that, cosmic irony demands we end up >> implementing in C, or something. > Hm, a UDF that does this rather than a Hive query would probably be better. > E.g. > > SELECT > request_qualifier(uri_host), > count(*) > FROM > wmf_raw.webrequest > WHERE > is_pageview(uri_host, uri_path, http_status, content_type) > GROUP BY > request_qualifier(uri_host) > ; > > > Or something like that. > > -Ao > > > > > > >> On Dec 15, 2014, at 14:07, Oliver Keyes <oke...@wikimedia.org >> <mailto:oke...@wikimedia.org>> wrote: >> >> It's totally tech-agnostic; the neutral definition is on meta. The hive >> query is just because, since we suspect that's how we'll be generating the >> data, it makes sense to turn the draft def into HQL for exploratory queries >> and testing. >> >> ...of course, now that I've said that, cosmic irony demands we end up >> implementing in C, or something. >> >> On 15 December 2014 at 13:46, Toby Negrin <tneg...@wikimedia.org >> <mailto:tneg...@wikimedia.org>> wrote: >> I think the hive code is "representative" in that it's an implementation. >> It's certainly not the only permitted one. >> >> On Dec 15, 2014, at 10:34 AM, Andrew Otto <ao...@wikimedia.org >> <mailto:ao...@wikimedia.org>> wrote: >> >>>> We're moving forward to generate Hive queries that will represent the >>>> formal specification. >>> Should a specific implementation (e.g. Hive) represent the formal >>> specification? I tend to think it should be tech-agnostic, no? >>> >>> >>> >>>> On Dec 15, 2014, at 12:15, Aaron Halfaker <ahalfa...@wikimedia.org >>>> <mailto:ahalfa...@wikimedia.org>> wrote: >>>> >>>> Toby, that's right. We're moving forward to generate Hive queries that >>>> will represent the formal specification. >>>> >>>> -Aaron >>>> >>>> On Mon, Dec 15, 2014 at 9:12 AM, Oliver Keyes <oke...@wikimedia.org >>>> <mailto:oke...@wikimedia.org>> wrote: >>>> We've written the draft Hive queries and I'm reviewing them with Otto now. >>>> Currently blocked on Hadoop heapsize issues, but I'm sure we'll work it >>>> through :). >>>> >>>> On 15 December 2014 at 12:10, Toby Negrin <tneg...@wikimedia.org >>>> <mailto:tneg...@wikimedia.org>> wrote: >>>> Hi Aaron, all -- >>>> >>>> I haven't seen any discussion on this which is a sign that we can forward >>>> with turning over the draft. Thoughts? >>>> >>>> thanks, >>>> >>>> -Toby >>>> >>>> On Tue, Dec 9, 2014 at 5:15 PM, Aaron Halfaker <ahalfa...@wikimedia.org >>>> <mailto:ahalfa...@wikimedia.org>> wrote: >>>> Hey folks, >>>> >>>> As discussions on the new page view definition have been calming down, >>>> we're preparing to deliver a draft version to the Devs. I want to make >>>> sure that we all know the status and that any substantial concerns are >>>> raised before we hand things off on Friday, Dec 12th. >>>> >>>> For this phase, we are delivering the general filter[1]. This is the >>>> highest level filter, and exists primarily to distinguish requests worthy >>>> of further evaluation. Our plan is to take the definition as it exists on >>>> the 12th, and begin generating high-level aggregate numbers based on it. >>>> In future iterations, we will be digging into different breakdowns of this >>>> metric, and iterating on it to handle any inconsistencies or unexpected >>>> results. There's a few differences from Web Stat Collector's (WSC) >>>> version of the general filter that we want to call to your attention to. >>>> We include searches -- WSC explicitly excludes them. >>>> We include Apps traffic -- WSC does not detect Apps traffic >>>> We include variants of /wiki/ (e.g. /zh-tw/, /zh-cn/, /sr-ec/) -- WSC >>>> hardcodes "/wiki/" >>>> We don't include Banner impressions -- WSC includes them. >>>> There are also some known issues with the new definition that are worth >>>> your notice: >>>> >>>> Internal traffic is counted >>>> Note that WSC filters some internal traffic by hardcoding a set of IPs in >>>> the definition. We are working on parsing puppet templates in order to >>>> automatically detect which IPs represent internal traffic. This will be a >>>> /better/ solution, but it's not quite ready yet because parsing puppet is >>>> hard. >>>> Spider traffic is counted >>>> We will be using the User-agent field to detect and flag spider-based >>>> traffic. This "tag definition" will be delivered in a subsequent >>>> definition. This actually matches WSC, which does not filter spider for >>>> the high-level metrics. >>>> These are problems we're aware of, and will be factoring in as we go >>>> forward with our next task: refining the definition using real, >>>> hourly-level traffic data. Thanks to everyone who has given feedback and >>>> participated in the process thus far, particularly Nemo, Erik, and >>>> Christian. >>>> >>>> 1. https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters >>>> <https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters> >>>> >>>> -Aaron & Oliver >>>> >>>> _______________________________________________ >>>> Analytics mailing list >>>> Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> <https://lists.wikimedia.org/mailman/listinfo/analytics> >>>> >>>> >>>> _______________________________________________ >>>> Analytics mailing list >>>> Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> <https://lists.wikimedia.org/mailman/listinfo/analytics> >>>> >>>> >>>> >>>> -- >>>> Oliver Keyes >>>> Research Analyst >>>> Wikimedia Foundation >>>> >>>> _______________________________________________ >>>> Analytics mailing list >>>> Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> <https://lists.wikimedia.org/mailman/listinfo/analytics> >>>> >>>> >>>> _______________________________________________ >>>> Analytics mailing list >>>> Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> <https://lists.wikimedia.org/mailman/listinfo/analytics> >>> >>> _______________________________________________ >>> Analytics mailing list >>> Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> <https://lists.wikimedia.org/mailman/listinfo/analytics> >> >> _______________________________________________ >> Analytics mailing list >> Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> >> https://lists.wikimedia.org/mailman/listinfo/analytics >> <https://lists.wikimedia.org/mailman/listinfo/analytics> >> >> >> >> -- >> Oliver Keyes >> Research Analyst >> Wikimedia Foundation >> _______________________________________________ >> Analytics mailing list >> Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> >> https://lists.wikimedia.org/mailman/listinfo/analytics >> <https://lists.wikimedia.org/mailman/listinfo/analytics> > > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> > https://lists.wikimedia.org/mailman/listinfo/analytics > <https://lists.wikimedia.org/mailman/listinfo/analytics> > > > > -- > Oliver Keyes > Research Analyst > Wikimedia Foundation > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics