Gonna paste your reply on the ticket
<https://phabricator.wikimedia.org/T184793> and respond there.



On Wed, Feb 7, 2018 at 1:29 PM, Tilman Bayer <tba...@wikimedia.org> wrote:

> On Wed, Feb 7, 2018 at 9:19 AM, Andrew Otto <o...@wikimedia.org> wrote:
> >> It will create significant discrepancies with the existing geolocation
> >> data we record for pageviews
> > If you only need country (or whatever is in the cookie), then likely
> > whatever the output dataset is would only include country when selecting
> > from pageviews.  If you need more than country (it sounded like you
> didn’t),
> > then we can get into doing the IP Geocoding  in EventLogging, but there
> are
> > few technical complications here, and we’re prefer not to have to do
> this if
> > we don’t have to.
>
> As mentioned repeatedly in this thread (see e.g. Sam's Jan 29 email),
> the goal is to record metadata consistent with with our existing
> content consumption measurement, concretely: the fields available in
> the pageview_hourly table. See
> https://phabricator.wikimedia.org/T186728 for details (also regarding
> other fields that are not in EL by default but are likewise generated
> in a standard fashion for webrequest/pageview data).
>
> I appreciate it will need a bit of engineering work to implement your
> proposal of reusing the existing UDF that underlies the pageview data
> for the new preview data. But it will serve to avoid a lot of data
> limitations and headaches for years to come. To highlight just one
> aspect: If we relied on the cookie, the data would be inconsistent
> from the start because not all clients accept cookies. When we want to
> know (say) the ratio of previews to pageviews in a particular country,
> we don't want to have to embark on a research project estimating the
> number of cookie-less pageviews in that country. And so on.
>
>
> >
> > On Wed, Feb 7, 2018 at 12:09 PM, Tilman Bayer <tba...@wikimedia.org>
> wrote:
> >>
> >> Thanks everyone! Separate from Sam's mapping out the frontend
> >> instrumentation work at https://phabricator.wikimedia.org/T184793 , I
> have
> >> created a task for the backend work at
> >> https://phabricator.wikimedia.org/T186728 based on this thread.
> >>
> >> Regarding the last few posts about the geolocation information, from the
> >> data analysis perspective, there is indeed another, more serious concern
> >> about using the GeoIP cookie: It will create significant discrepancies
> with
> >> the existing geolocation data we record for pageviews, where we have
> chosen
> >> to derive this information from the IP instead. (Remember the
> overarching
> >> goal here of measuring page previews the same way we measure page views
> >> currently; the basic principle is that if a reader visits a page and
> then
> >> uses the page preview feature on that page to read preview cards, all
> the
> >> metadata that is recorded for both should have identical values for
> both the
> >> preview and the pageview.) Therefore, we should go with the kind of
> solution
> >> Andrew outlined above (adapting/reusing GetGeoDataUDF or such).
> >>
> >> On Thu, Feb 1, 2018 at 7:36 AM, Andrew Otto <o...@wikimedia.org> wrote:
> >>>
> >>> Wow Sam, yeah, if this cookie works for you, it will make many things
> >>> much easier for us.  Check it out and let us know.  If it doesn’t work
> for
> >>> some reason, we can figure out the backend geocoding part.
> >>>
> >>>
> >>>
> >>> On Thu, Feb 1, 2018 at 2:43 AM, Sam Smith <samsm...@wikimedia.org>
> wrote:
> >>>>
> >>>> On Tue, Jan 30, 2018 at 8:02 AM, Andrew Otto <o...@wikimedia.org>
> wrote:
> >>>>>
> >>>>> > Using the GeoIP cookie will require reconfiguring the EventLogging
> >>>>> > varnishkafka instance [0]
> >>>>>
> >>>>> I’m not familiar with this cookie, but, if we used it, I thought it
> >>>>> would be sent back to by the client in the event. E.g. event.country
> =
> >>>>> response.headers.country; EventLogging.emit(event);
> >>>>>
> >>>>> That way, there’s no additional special logic needed on the server
> side
> >>>>> to geocode or populate the country in the event.
> >>>>
> >>>>
> >>>> Hah! I didn't think about accessing the GeoIP cookie on the client. As
> >>>> you say, the implementation is quite easy.
> >>>>
> >>>> My only concern with this approach is the duplication of the value
> >>>> between the cookie, which is sent in every HTTP request to the
> /beacon/event
> >>>> endpoint, and the event itself. This duplication seems reasonable when
> >>>> balanced against capturing either: the client IP and then doing
> similar
> >>>> geocoding further along in the pipeline; or the cookie for all
> requests to
> >>>> that endpoint and then discarding them further along in the pipeline.
> It
> >>>> also reflects a seemingly core principle of the EventLogging system:
> that it
> >>>> doesn't capture potentiallly PII by default.
> >>>>
> >>>> -Sam
> >>>>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> Analytics mailing list
> >>>> Analytics@lists.wikimedia.org
> >>>> https://lists.wikimedia.org/mailman/listinfo/analytics
> >>>>
> >>>
> >>>
> >>> _______________________________________________
> >>> Analytics mailing list
> >>> Analytics@lists.wikimedia.org
> >>> https://lists.wikimedia.org/mailman/listinfo/analytics
> >>>
> >>
> >>
> >>
> >> --
> >> Tilman Bayer
> >> Senior Analyst
> >> Wikimedia Foundation
> >> IRC (Freenode): HaeB
> >>
> >> _______________________________________________
> >> Analytics mailing list
> >> Analytics@lists.wikimedia.org
> >> https://lists.wikimedia.org/mailman/listinfo/analytics
> >>
> >
> >
> > _______________________________________________
> > Analytics mailing list
> > Analytics@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/analytics
> >
>
>
>
> --
> Tilman Bayer
> Senior Analyst
> Wikimedia Foundation
> IRC (Freenode): HaeB
>
> _______________________________________________
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to