Gonna paste your reply on the ticket <https://phabricator.wikimedia.org/T184793> and respond there.
On Wed, Feb 7, 2018 at 1:29 PM, Tilman Bayer <tba...@wikimedia.org> wrote: > On Wed, Feb 7, 2018 at 9:19 AM, Andrew Otto <o...@wikimedia.org> wrote: > >> It will create significant discrepancies with the existing geolocation > >> data we record for pageviews > > If you only need country (or whatever is in the cookie), then likely > > whatever the output dataset is would only include country when selecting > > from pageviews. If you need more than country (it sounded like you > didn’t), > > then we can get into doing the IP Geocoding in EventLogging, but there > are > > few technical complications here, and we’re prefer not to have to do > this if > > we don’t have to. > > As mentioned repeatedly in this thread (see e.g. Sam's Jan 29 email), > the goal is to record metadata consistent with with our existing > content consumption measurement, concretely: the fields available in > the pageview_hourly table. See > https://phabricator.wikimedia.org/T186728 for details (also regarding > other fields that are not in EL by default but are likewise generated > in a standard fashion for webrequest/pageview data). > > I appreciate it will need a bit of engineering work to implement your > proposal of reusing the existing UDF that underlies the pageview data > for the new preview data. But it will serve to avoid a lot of data > limitations and headaches for years to come. To highlight just one > aspect: If we relied on the cookie, the data would be inconsistent > from the start because not all clients accept cookies. When we want to > know (say) the ratio of previews to pageviews in a particular country, > we don't want to have to embark on a research project estimating the > number of cookie-less pageviews in that country. And so on. > > > > > > On Wed, Feb 7, 2018 at 12:09 PM, Tilman Bayer <tba...@wikimedia.org> > wrote: > >> > >> Thanks everyone! Separate from Sam's mapping out the frontend > >> instrumentation work at https://phabricator.wikimedia.org/T184793 , I > have > >> created a task for the backend work at > >> https://phabricator.wikimedia.org/T186728 based on this thread. > >> > >> Regarding the last few posts about the geolocation information, from the > >> data analysis perspective, there is indeed another, more serious concern > >> about using the GeoIP cookie: It will create significant discrepancies > with > >> the existing geolocation data we record for pageviews, where we have > chosen > >> to derive this information from the IP instead. (Remember the > overarching > >> goal here of measuring page previews the same way we measure page views > >> currently; the basic principle is that if a reader visits a page and > then > >> uses the page preview feature on that page to read preview cards, all > the > >> metadata that is recorded for both should have identical values for > both the > >> preview and the pageview.) Therefore, we should go with the kind of > solution > >> Andrew outlined above (adapting/reusing GetGeoDataUDF or such). > >> > >> On Thu, Feb 1, 2018 at 7:36 AM, Andrew Otto <o...@wikimedia.org> wrote: > >>> > >>> Wow Sam, yeah, if this cookie works for you, it will make many things > >>> much easier for us. Check it out and let us know. If it doesn’t work > for > >>> some reason, we can figure out the backend geocoding part. > >>> > >>> > >>> > >>> On Thu, Feb 1, 2018 at 2:43 AM, Sam Smith <samsm...@wikimedia.org> > wrote: > >>>> > >>>> On Tue, Jan 30, 2018 at 8:02 AM, Andrew Otto <o...@wikimedia.org> > wrote: > >>>>> > >>>>> > Using the GeoIP cookie will require reconfiguring the EventLogging > >>>>> > varnishkafka instance [0] > >>>>> > >>>>> I’m not familiar with this cookie, but, if we used it, I thought it > >>>>> would be sent back to by the client in the event. E.g. event.country > = > >>>>> response.headers.country; EventLogging.emit(event); > >>>>> > >>>>> That way, there’s no additional special logic needed on the server > side > >>>>> to geocode or populate the country in the event. > >>>> > >>>> > >>>> Hah! I didn't think about accessing the GeoIP cookie on the client. As > >>>> you say, the implementation is quite easy. > >>>> > >>>> My only concern with this approach is the duplication of the value > >>>> between the cookie, which is sent in every HTTP request to the > /beacon/event > >>>> endpoint, and the event itself. This duplication seems reasonable when > >>>> balanced against capturing either: the client IP and then doing > similar > >>>> geocoding further along in the pipeline; or the cookie for all > requests to > >>>> that endpoint and then discarding them further along in the pipeline. > It > >>>> also reflects a seemingly core principle of the EventLogging system: > that it > >>>> doesn't capture potentiallly PII by default. > >>>> > >>>> -Sam > >>>> > >>>> > >>>> > >>>> _______________________________________________ > >>>> Analytics mailing list > >>>> Analytics@lists.wikimedia.org > >>>> https://lists.wikimedia.org/mailman/listinfo/analytics > >>>> > >>> > >>> > >>> _______________________________________________ > >>> Analytics mailing list > >>> Analytics@lists.wikimedia.org > >>> https://lists.wikimedia.org/mailman/listinfo/analytics > >>> > >> > >> > >> > >> -- > >> Tilman Bayer > >> Senior Analyst > >> Wikimedia Foundation > >> IRC (Freenode): HaeB > >> > >> _______________________________________________ > >> Analytics mailing list > >> Analytics@lists.wikimedia.org > >> https://lists.wikimedia.org/mailman/listinfo/analytics > >> > > > > > > _______________________________________________ > > Analytics mailing list > > Analytics@lists.wikimedia.org > > https://lists.wikimedia.org/mailman/listinfo/analytics > > > > > > -- > Tilman Bayer > Senior Analyst > Wikimedia Foundation > IRC (Freenode): HaeB > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics >
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics