Re: [Analytics] Geo-aggregation of Wikipedia page views: Maximizing geographic granularity while preserving privacy – a proposal

Priedhorsky, Reid Wed, 13 May 2015 12:56:31 -0700

Hi folks,

Reviving an old thread (my apologies for the delay). I’ve looked over this 
thread, the talk page linked below, and a few other places that seemed like 
they might have feedback for us.


It seemed to me that key feedback, in addition to some technical suggestions, 
was:


  *   Ratio of logged in to logged out readers can be inferred.
  *   Think more carefully about whether reading patterns can be inferred for 
anonymous editors.
  *   How to interpret the Do-Not-Track header is controversial.

As for DNT, my main concern from the research perspective is, would 
interpreting DNT as exclusion from geo-aggregation reduce the sample size 
excessively. Luis Villa’s link for Firefox numbers shows a peak of 11% in March 
2013, declining to 8% at the end of the data in September 2014, for desktop 
version, with a 17% peak in July 2012 and a similar decline to 5% in September 
2014 for mobile users. With these types of numbers, I believe the larger sample 
(i.e., DNT hits included in geo-aggregation) will indeed support somewhat more 
robust results, but the smaller sample (exclude DNT) is fine. I worry some 
about growth, but as long as it’s not the default, that’s probably not a major 
concern.

One thing that I would really like feedback on is: what is an acceptable k — 
i.e., how large is the set of users from whom a specific user is 
indistinguishable? I believe this will have a significantly greater impact on 
the quality of our results than DNT.

Please let me know if I’ve missed anything. I’d like to rev the proposal soon, 
and I’d like to make it responsive to what the community thinks.

Thanks,
Reid

[Just to be absolutely clear, I’m speaking for myself, not my employer.]

On 13 January 2015 at 07:26, Dario Taraborelli 
<dtarabore...@wikimedia.org<mailto:dtarabore...@wikimedia.org>> wrote:
>>
I’m sharing a proposal that Reid Priedhorsky and his collaborators at Los 
Alamos National Laboratory recently submitted to the Wikimedia Analytics Team 
aimed at producing privacy-preserving geo-aggregates of Wikipedia pageview data 
dumps and making them available to the public and the research community. [1]

Reid and his team spearheaded the use of the public Wikipedia pageview dumps to 
monitor and forecast the spread of influenza and other diseases, using language 
as a proxy for location. This proposal describes an aggregation strategy adding 
a geographical dimension to the existing dumps.

Feedback on the proposal is welcome on the lists or the project talk page on 
Meta [3]

Dario

[1] 
https://meta.wikimedia.org/wiki/Research:Geo-aggregation_of_Wikipedia_pageviews
[2] http://dx.doi.org/10.1371/journal.pcbi.1003892
[3] 
https://meta.wikimedia.org/wiki/Research_talk:Geo-aggregation_of_Wikipedia_pageviews

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Geo-aggregation of Wikipedia page views: Maximizing geographic granularity while preserving privacy – a proposal

Reply via email to