Re: [Wikitech-l] Unbreaking statistics

2009-06-07 Thread John at Darkstar
Some articles are always very seldom referred and those can be used to
uniquely identify a machine. Then there are all those who do something
that goes into public logs. The later are very difficult to obfuscate,
but the first one is possible to solve by setting a time frame long
enough that sufficient alternate traffic will be within the same window.
Unfortunately this time frame is pretty long for some articles, and from
some tests it seems to be weeks on Norsk (bokmål) Wikipedia.
John

Robert Rohde skrev:
 On Fri, Jun 5, 2009 at 9:20 PM, Gregory Maxwellgmaxw...@gmail.com wrote:
 On Fri, Jun 5, 2009 at 10:13 PM, Robert Rohderaro...@gmail.com wrote:
 There is a lot of private data in user agents (MSIE 4.123; WINNT 4.0;
 bouncing_ferret_toolbar_1.23 drunken_monkey_downloader_2.34 may be
 uniquely identifying). There is even private data titles if you don't
 sanitize carefully
 (/wiki/search?lookup=From%20rarohde%20To%20Gmaxwell%20OMG%20secret%20stuff%20lemme%20accidently%20paste%20it%20into%20the%20search%20box).
  There is private data in referrers
 (http://rarohde.com/url_that_only_rarohde_would_have_comefrom).

 Things which individually do not appear to disclose anything private
 can disclose private things (look at the people uniquely identified by
 AOL's 'anonymized' search data).

 On the flip side, aggregation can take private things (i.e.
 useragents; IP info; referrers) and convert it to non-private data:
 Top user agents; top referrers; highest traffic ASNs... but becomes
 potentially revealing if not done carefully: The 'top' network and
 user agent info for a single obscure article in a short time window
 may be information from only one or two users, not really an
 aggregation.

 Things like common paths through the site should be safe so long as
 they are not provided with too much temporal resolution, limit
 themselves to existing articles, and limit themselves to either really
 common paths or breaking paths into two or three node chains and skip
 releasing the least common of those.

 Generally when dealing with private data you must approach it with the
 same attitude that a C coder must take to avoid buffer overflows.
 Treat all data as hostile, assume all actions are potentially
 dangerous. Try to figure out how to break it, and think deviously.
 
 On reflection I agree with you, though I think the biggest problem
 would actually be a case you didn't mention.  If one provided timing
 and page view information, then one can almost certainly single out
 individual users by correlating the view timing with edit histories.
 
 Okay, so no stripped logs.  The next question becomes what is the
 right way to aggregate.  We can A) reinvent the wheel, or B) adapt a
 pre-existing log analyzer in a mode to produce clean aggregate data.
 While I respect the work of Zachte and others, this might be a case
 where B is a better near-term solution.
 
 Looking at http://stats.wikipedia.hu/cgi-bin/awstats.pl (the page that
 started this mess), his AWStats config already suppresses IP info and
 aggregates everything into groups that make it very hard to identify
 anything personal from.  (There is still a small risk with allowing
 users to drill down to pages / requests that are almost never made,
 but perhaps that could be turned off.)  AWStats has native support for
 Squid logs and is open source.
 
 This is not necessarily the only option, but I suspect that if we gave
 it some thought it would be possible to find an off-the-shelf tool
 that would be good enough to support many wikis and configurable
 enough to satisfy even the GMaxwell's of the world ;-).  huwiki is
 actually the 20th largest wiki (by number of edits), so if it worked
 for them, then a tool like AWStats can probably work for most of the
 projects (which are not EN).
 
 -Robert Rohde
 
 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l
 

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Unbreaking statistics

2009-06-07 Thread John at Darkstar
Had to run and missed a couple of important items. One is that you can
calculate the likelihood that a link is missing. (Its similar to Googles
page rank) If the likelihood turns out to be to small you simply don't
report anything. You also can skip reporting if you don't have any
intervening search or search result. You can also analyze the link
structure at the log server and skip logging of uninteresting items, but
even more interesting, this can be done client side if sufficient
information is embedded on the pages. For this, note that a high
likelihood of a missing link implies few existing inbound links and then
they can simply be embedded on the page itself.

Analyzing a dumb log would be very costly indeed.

John

John at Darkstar skrev:
 I tried to convince myself to stay out of this thread, but this was
 somewhat interesting. ;)
 
 I'm not quite sure this will work out for every case, but my gross idea
 is like this:
 
 Imagine an user trying to get an answare about some kind of problem. He
 searches with Google and dumps into the most obvious article about (for
 example) breweries, even if he really want to know something about a
 specific beer (Groelch or Mac or whatever). He can't find anything about
 it so he makes an additional search (at the page hopefully), gets a
 result list, reads through a lot of articles and then finally finds what
 he searches for. Then he leaves.
 
 Now, imagine a function that pushes new visited pages on a small page
 list and a function popping that list each time the search result page
 is visited. The page list is stored in a cookie. This small page list is
 then reported to a special logging server by a AJAX request. It can't
 just piggyback as the final page usually will not lead to a new request,
 - the user simply leaves.
 
 Later a lot of such page lists an be analyzed and compared to the known
 link structure. If a pair of pages consistently emerges in the log
 without having a parent - child relation then you know a link is missing.
 
 Some guestimates says that you need more than 100 page views before
 something like this can detect obvious missing links. For Norwegian
 (bokmål) Wikipedia that is about 2-3 months of statistics for half the
 article base, but note that the accumulated stats would be rectified by
 the page redirect information from the database as a link very seldom
 are dropped it is usually added.
 
 Well, something like that. I was wondering about running a test case but
 given some previous discussion I concluded that I would get a go on this.
 
 It is also possible to analyze the article relations where the user goes
 back to Googles result list but that is somewhat more evolved.
 
 John
 
 Platonides skrev:
 John at Darkstar wrote:
 If someone wants to work on this I have some ideas to make something
 usefull out of this log, but I'm a bit short on time. Basically its two
 ideas that are really usefull; one is to figure out which articles are
 most interesting to show in a portal and the other is how to detect
 articles with missing linking between them.
 John
 How are you planning to detect articles which 'should have links'
 between them?


 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

 
 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l
 

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] Unbreaking statistics

2009-06-05 Thread Peter Gervai
Hello,

I see I've created quite a stir around, but so far nothing really
useful popped up. :-(

But I see that one from Neil:
 Yes, modifying the http://stats.grok.se/ systems looks like the way to go.

For me it doesn't really seem to be, since it seems to be using an
extremely dumbed down version of input, which only contains page views
and [unreliable] byte counters. Most probably it would require large
rewrites, and a magical new data source.

 What do people actually want to see from the traffic data? Do they want
 referrers, anonymized user trails, or what?

Are you old enough to remember stats.wikipedia.org? As far as I
remember originally it ran webalizer, then something else, then
nothing. If you check a webalizer stat you'll see what's in it. We are
using, or we used until our nice fellow editors broke it, awstats,
which basically provides the same with more caching.

Most used and useful stats are page views (daily and hourly stats are
pretty useful too), referrers, visitor domain and provider stats, os
and browser stats, screen resolution stats, bot activity stats,
visitor duration and depth, among probably others.

At a brief glance I could replicate the grok.se stats easily since it
seems to work out of http://dammit.lt/wikistats/, but it's completely
useless for anything beyond page hit count.

Is there a possibility to write a code which process raw squid data?
Who do I have to bribe? :-/

-- 
 byte-byte,
grin

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Unbreaking statistics

2009-06-05 Thread Alex
Peter Gervai wrote:
 Hello,
 
 I see I've created quite a stir around, but so far nothing really
 useful popped up. :-(
 
 But I see that one from Neil:
 Yes, modifying the http://stats.grok.se/ systems looks like the way to go.
 
 For me it doesn't really seem to be, since it seems to be using an
 extremely dumbed down version of input, which only contains page views
 and [unreliable] byte counters. Most probably it would require large
 rewrites, and a magical new data source.
 
 What do people actually want to see from the traffic data? Do they want
 referrers, anonymized user trails, or what?
 
 Are you old enough to remember stats.wikipedia.org? As far as I
 remember originally it ran webalizer, then something else, then
 nothing. If you check a webalizer stat you'll see what's in it. We are
 using, or we used until our nice fellow editors broke it, awstats,
 which basically provides the same with more caching.
 
 Most used and useful stats are page views (daily and hourly stats are
 pretty useful too), referrers, visitor domain and provider stats, os
 and browser stats, screen resolution stats, bot activity stats,
 visitor duration and depth, among probably others.
 
 At a brief glance I could replicate the grok.se stats easily since it
 seems to work out of http://dammit.lt/wikistats/, but it's completely
 useless for anything beyond page hit count.
 
 Is there a possibility to write a code which process raw squid data?
 Who do I have to bribe? :-/
 

We do have http://stats.wikimedia.org/ which includes things like
http://stats.wikimedia.org/EN/VisitorsSampledLogOrigins.htm

-- 
Alex (wikipedia:en:User:Mr.Z-man)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Unbreaking statistics

2009-06-05 Thread Robert Rohde
On Fri, Jun 5, 2009 at 6:38 PM, Tim Starlingtstarl...@wikimedia.org wrote:
 Peter Gervai wrote:
 Is there a possibility to write a code which process raw squid data?
 Who do I have to bribe? :-/

 Yes it's possible. You just need to write a script that accepts a log
 stream on stdin and builds the aggregate data from it. If you want
 access to IP addresses, it needs to run on our own servers with only
 anonymised data being passed on to the public.

 http://wikitech.wikimedia.org/view/Squid_logging
 http://wikitech.wikimedia.org/view/Squid_log_format


How much of that is really considered private?  IP addresses
obviously, anything else?

I'm wondering if a cheap and dirty solution (at least for the low
traffic wikis) might be to write a script that simply scrubs the
private information and makes the rest available for whatever
applications people might want.

-Robert Rohde

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Unbreaking statistics

2009-06-05 Thread Gregory Maxwell
On Fri, Jun 5, 2009 at 10:13 PM, Robert Rohderaro...@gmail.com wrote:
 On Fri, Jun 5, 2009 at 6:38 PM, Tim Starlingtstarl...@wikimedia.org wrote:
 Peter Gervai wrote:
 Is there a possibility to write a code which process raw squid data?
 Who do I have to bribe? :-/

 Yes it's possible. You just need to write a script that accepts a log
 stream on stdin and builds the aggregate data from it. If you want
 access to IP addresses, it needs to run on our own servers with only
 anonymised data being passed on to the public.

 http://wikitech.wikimedia.org/view/Squid_logging
 http://wikitech.wikimedia.org/view/Squid_log_format


 How much of that is really considered private?  IP addresses
 obviously, anything else?

 I'm wondering if a cheap and dirty solution (at least for the low
 traffic wikis) might be to write a script that simply scrubs the
 private information and makes the rest available for whatever
 applications people might want.

There is a lot of private data in user agents (MSIE 4.123; WINNT 4.0;
bouncing_ferret_toolbar_1.23 drunken_monkey_downloader_2.34 may be
uniquely identifying). There is even private data titles if you don't
sanitize carefully
(/wiki/search?lookup=From%20rarohde%20To%20Gmaxwell%20OMG%20secret%20stuff%20lemme%20accidently%20paste%20it%20into%20the%20search%20box).
 There is private data in referrers
(http://rarohde.com/url_that_only_rarohde_would_have_comefrom).

Things which individually do not appear to disclose anything private
can disclose private things (look at the people uniquely identified by
AOL's 'anonymized' search data).

On the flip side, aggregation can take private things (i.e.
useragents; IP info; referrers) and convert it to non-private data:
Top user agents; top referrers; highest traffic ASNs... but becomes
potentially revealing if not done carefully: The 'top' network and
user agent info for a single obscure article in a short time window
may be information from only one or two users, not really an
aggregation.

Things like common paths through the site should be safe so long as
they are not provided with too much temporal resolution, limit
themselves to existing articles, and limit themselves to either really
common paths or breaking paths into two or three node chains and skip
releasing the least common of those.

Generally when dealing with private data you must approach it with the
same attitude that a C coder must take to avoid buffer overflows.
Treat all data as hostile, assume all actions are potentially
dangerous. Try to figure out how to break it, and think deviously.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Unbreaking statistics

2009-06-05 Thread Brian
Scrubbing log files to make the data private is hard work. You'd be
impressed by what researchers have been able to do - taking purportedly
anonymous data and using it to identify users en masse by correlating it
with publicly available data from other sites such as Amazon, Facebook and
Netflix. Make no doubt - if you don't do it carefully you will be the target
of, in the best of cases, an academic researcher who wants to prove that you
don't understand statistics.

On Fri, Jun 5, 2009 at 8:13 PM, Robert Rohde raro...@gmail.com wrote:

 On Fri, Jun 5, 2009 at 6:38 PM, Tim Starlingtstarl...@wikimedia.org
 wrote:
  Peter Gervai wrote:
  Is there a possibility to write a code which process raw squid data?
  Who do I have to bribe? :-/
 
  Yes it's possible. You just need to write a script that accepts a log
  stream on stdin and builds the aggregate data from it. If you want
  access to IP addresses, it needs to run on our own servers with only
  anonymised data being passed on to the public.
 
  http://wikitech.wikimedia.org/view/Squid_logging
  http://wikitech.wikimedia.org/view/Squid_log_format
 

 How much of that is really considered private?  IP addresses
 obviously, anything else?

 I'm wondering if a cheap and dirty solution (at least for the low
 traffic wikis) might be to write a script that simply scrubs the
 private information and makes the rest available for whatever
 applications people might want.

 -Robert Rohde

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Unbreaking statistics

2009-06-05 Thread Robert Rohde
On Fri, Jun 5, 2009 at 9:20 PM, Gregory Maxwellgmaxw...@gmail.com wrote:
 On Fri, Jun 5, 2009 at 10:13 PM, Robert Rohderaro...@gmail.com wrote:
 There is a lot of private data in user agents (MSIE 4.123; WINNT 4.0;
 bouncing_ferret_toolbar_1.23 drunken_monkey_downloader_2.34 may be
 uniquely identifying). There is even private data titles if you don't
 sanitize carefully
 (/wiki/search?lookup=From%20rarohde%20To%20Gmaxwell%20OMG%20secret%20stuff%20lemme%20accidently%20paste%20it%20into%20the%20search%20box).
  There is private data in referrers
 (http://rarohde.com/url_that_only_rarohde_would_have_comefrom).

 Things which individually do not appear to disclose anything private
 can disclose private things (look at the people uniquely identified by
 AOL's 'anonymized' search data).

 On the flip side, aggregation can take private things (i.e.
 useragents; IP info; referrers) and convert it to non-private data:
 Top user agents; top referrers; highest traffic ASNs... but becomes
 potentially revealing if not done carefully: The 'top' network and
 user agent info for a single obscure article in a short time window
 may be information from only one or two users, not really an
 aggregation.

 Things like common paths through the site should be safe so long as
 they are not provided with too much temporal resolution, limit
 themselves to existing articles, and limit themselves to either really
 common paths or breaking paths into two or three node chains and skip
 releasing the least common of those.

 Generally when dealing with private data you must approach it with the
 same attitude that a C coder must take to avoid buffer overflows.
 Treat all data as hostile, assume all actions are potentially
 dangerous. Try to figure out how to break it, and think deviously.

On reflection I agree with you, though I think the biggest problem
would actually be a case you didn't mention.  If one provided timing
and page view information, then one can almost certainly single out
individual users by correlating the view timing with edit histories.

Okay, so no stripped logs.  The next question becomes what is the
right way to aggregate.  We can A) reinvent the wheel, or B) adapt a
pre-existing log analyzer in a mode to produce clean aggregate data.
While I respect the work of Zachte and others, this might be a case
where B is a better near-term solution.

Looking at http://stats.wikipedia.hu/cgi-bin/awstats.pl (the page that
started this mess), his AWStats config already suppresses IP info and
aggregates everything into groups that make it very hard to identify
anything personal from.  (There is still a small risk with allowing
users to drill down to pages / requests that are almost never made,
but perhaps that could be turned off.)  AWStats has native support for
Squid logs and is open source.

This is not necessarily the only option, but I suspect that if we gave
it some thought it would be possible to find an off-the-shelf tool
that would be good enough to support many wikis and configurable
enough to satisfy even the GMaxwell's of the world ;-).  huwiki is
actually the 20th largest wiki (by number of edits), so if it worked
for them, then a tool like AWStats can probably work for most of the
projects (which are not EN).

-Robert Rohde

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l