Re: [Analytics] [Wiki-research-l] Geo-aggregation of Wikipedia page views: Maximizing geographic granularity while preserving privacy – a proposal

John Mark Vandenberg Wed, 14 Jan 2015 00:39:24 -0800

On Wed, Jan 14, 2015 at 2:25 PM, Oliver Keyes <ironho...@gmail.com> wrote:
> I'm confused; john, could you point to the element of the collected data
> that isn't collected already by default in any Nginx or Apache setup? I
> agree that there might be a lack of user expectation, but 'silently
> capturing behavioral data' seems somewhat hyperbolic to describe what's
> actually going on.

The proposed element to be added is geolocation below country level.
Default Nginx and Apache log formats do not include geolocation.
Which is why this research proposal exists and is being discussed, and
rightly so.

fwiw, the Nginx geoip module is not even included, by default, when
compiling the source code.

As the paper explicitly describes, and is a common theme in research
proposals, Wikimedia access log information is user reading behaviour
being captured.

The old privacy and data retention policies gave users the expectation
that access log data was destroyed after a set period, assumed to be
only three months as that was the limit of Checkuser visibility. The
current policies are more like "yes we collect a lot of data about
users, using tracking technology, and please trust us." And "sorry we
dont honour 'Dont track us', as we presumed that you trust us and the
researchers that we allow to access our analytics."

We should be planning for what will be the effect when the WMF servers
are hacked and _all_ of the analytics data is now in the hands of a
repressive government or similar. Or, imagine the WMF sends the
analytics data across an insecure link which is tapped and the data
reconstructed, either due to not using secure links at all, or an
accidental routing problem.
https://lists.wikimedia.org/pipermail/wikimedia-l/2013-December/129357.html

If/When that day comes, hopefully they don't have much data to make
inferences from, and what data they obtain can be well justified.

Having a quick peak, I thought it was odd that browser Wikimedia sites
now causes impressions to be sent back to the WMF servers with the
country of the user included. "This is a workaround to simplify
analytics."
https://git.wikimedia.org/blob/mediawiki%2Fextensions%2FCentralNotice/8ee8775a5df9f68857a337efadbb2b5d36811f1a/special%2FSpecialRecordImpression.php

The more you collect, especially using multiple systems to collect
similar data, the more likely that if subpoenaed, WMF's various
datasets could be used to infer a pretty reliable answer to "which
days in 2013 was John Vandenberg in Indonesia?", or "when did John
Vandenberg first read the Wikipedia article about <bomb making
ingredient>?" The more you publish, even aggregated, the more likely
these types of questions can be inferred without a subpoena, at least
for users with large enough lists of public contributions, by
scientists like yourself with lots of computation power and plenty of
time on their hands rifling through the data to *infer* the identify
of editors, and if it is a government body they also have lots of
other datasets which can be used to assist in the task.

Adding fine-grained geolocation information to published page views is
an example of the latter and the paper wisely suggests not including
logged in users as a possible solution to some of the privacy issues.

There is also the problem that many IPs can be easily inferred to be a
single cohort of people in some situations. e.g. in regions where the
only large collection of computers is an single facility, e.g. a
school. In a repressive regime especially, that could lead to
official questions being asked like: why were so many students at this
school reading about <blah> on <date>. And teachers being identified
as responsible, etc.

The paper considers IP users vs logged in users to be a binary set.
However there are tools built which exploit the fact that logged in
users make a logged out edit which identifies their IP. Add
geolocation of pageviews and we can infer the probability that other
IPs in their smallest geolocation block are also likely to be edits by
the same person, as the algorithm in the paper leaks 'number of active
editors in each region each day'.

The purpose of this proposed change in analytics is summarised in the paper:

'In short, the current global aggregation of Wikipedia page view is
unsuitable for an operational disease monitoring system. There will be
no “Wikipedia Flu Trends” unless page view data are aggregated at a
finer geographic scale.'

If "Wikipedia Flu Trends" is the justification, we had better be
certain that detecting Flu Trends using Wikipedia is going to be the
most effective method, and isn't just an academically interesting
exercise. A limited trial to determine utility would be helpful to
establish if "Wikipedia Flu Trends" is a viable world health solution
worthy of justifying additional data retention and publishing of
aggregates.

Is there a minimum threshold at which views of a page mean it becomes
'interesting' to analyse using finer grained geographic data. I
suspect that pages with only hundreds of page views per day are not
particularly useful for "Wikipedia Flu Trends".

Also does "Wikipedia Flu Trends" need to have access to geographically
tagged page view data of, say, me reading
http://en.wiktionary.org/wiki/bota today? Is there a way to restrict
which types of pages are tracked at finer geographic granularity
without adversely affecting the "Wikipedia Flu Trends" graph.

--
John Vandenberg

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] [Wiki-research-l] Geo-aggregation of Wikipedia page views: Maximizing geographic granularity while preserving privacy – a proposal

Reply via email to