Excellent! I'm assuming the spec is considered "final" pending any clarification comparison with the WSC data?
thanks, -Toby On Mon, Dec 15, 2014 at 9:12 AM, Oliver Keyes <oke...@wikimedia.org> wrote: > > We've written the draft Hive queries and I'm reviewing them with Otto now. > Currently blocked on Hadoop heapsize issues, but I'm sure we'll work it > through :). > > On 15 December 2014 at 12:10, Toby Negrin <tneg...@wikimedia.org> wrote: >> >> Hi Aaron, all -- >> >> I haven't seen any discussion on this which is a sign that we can forward >> with turning over the draft. Thoughts? >> >> thanks, >> >> -Toby >> >> On Tue, Dec 9, 2014 at 5:15 PM, Aaron Halfaker <ahalfa...@wikimedia.org> >> wrote: >> >>> Hey folks, >>> >>> As discussions on the new page view definition have been calming down, >>> we're preparing to deliver a draft version to the Devs. I want to make >>> sure that we all know the status and that any substantial concerns are >>> raised before we hand things off on *Friday, Dec 12th.* >>> >>> For this phase, we are delivering the general filter[1]. This is the >>> highest level filter, and exists primarily to distinguish requests worthy >>> of further evaluation. Our plan is to take the definition as it exists on >>> the 12th, and begin generating high-level aggregate numbers based on it. In >>> future iterations, we will be digging into different breakdowns of this >>> metric, and iterating on it to handle any inconsistencies or unexpected >>> results. There's a few differences from Web Stat Collector's (WSC) version >>> of the general filter that we want to call to your attention to. >>> >>> - We include searches -- WSC explicitly excludes them. >>> - We include Apps traffic -- WSC does not detect Apps traffic >>> - We include variants of /wiki/ (e.g. /zh-tw/, /zh-cn/, /sr-ec/) -- >>> WSC hardcodes "/wiki/" >>> - We don't include Banner impressions -- WSC includes them. >>> >>> There are also some known issues with the new definition that are worth >>> your notice: >>> >>> >>> 1. *Internal traffic is counted* >>> >>> >>> - Note that WSC filters some internal traffic by hardcoding a set of >>> IPs in the definition. We are working on parsing puppet templates in >>> order >>> to automatically detect which IPs represent internal traffic. This will >>> be >>> a /better/ solution, but it's not quite ready yet because parsing puppet >>> is >>> hard. >>> >>> >>> 1. *Spider traffic is counted* >>> >>> >>> - We will be using the User-agent field to detect and flag >>> spider-based traffic. This "tag definition" will be delivered in a >>> subsequent definition. This actually matches WSC, which does not filter >>> spider for the high-level metrics. >>> >>> These are problems we're aware of, and will be factoring in as we go >>> forward with our next task: refining the definition using real, >>> hourly-level traffic data. Thanks to everyone who has given feedback and >>> participated in the process thus far, particularly Nemo, Erik, and >>> Christian. >>> >>> 1. >>> https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters >>> >>> -Aaron & Oliver >>> >>> _______________________________________________ >>> Analytics mailing list >>> Analytics@lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >> _______________________________________________ >> Analytics mailing list >> Analytics@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> > > -- > Oliver Keyes > Research Analyst > Wikimedia Foundation > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics > >
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics