Re: [Wikitech-l] Unbreaking statistics
Some articles are always very seldom referred and those can be used to uniquely identify a machine. Then there are all those who do something that goes into public logs. The later are very difficult to obfuscate, but the first one is possible to solve by setting a time frame long enough that sufficient alternate traffic will be within the same window. Unfortunately this time frame is pretty long for some articles, and from some tests it seems to be weeks on Norsk (bokmål) Wikipedia. John Robert Rohde skrev: On Fri, Jun 5, 2009 at 9:20 PM, Gregory Maxwellgmaxw...@gmail.com wrote: On Fri, Jun 5, 2009 at 10:13 PM, Robert Rohderaro...@gmail.com wrote: There is a lot of private data in user agents (MSIE 4.123; WINNT 4.0; bouncing_ferret_toolbar_1.23 drunken_monkey_downloader_2.34 may be uniquely identifying). There is even private data titles if you don't sanitize carefully (/wiki/search?lookup=From%20rarohde%20To%20Gmaxwell%20OMG%20secret%20stuff%20lemme%20accidently%20paste%20it%20into%20the%20search%20box). There is private data in referrers (http://rarohde.com/url_that_only_rarohde_would_have_comefrom). Things which individually do not appear to disclose anything private can disclose private things (look at the people uniquely identified by AOL's 'anonymized' search data). On the flip side, aggregation can take private things (i.e. useragents; IP info; referrers) and convert it to non-private data: Top user agents; top referrers; highest traffic ASNs... but becomes potentially revealing if not done carefully: The 'top' network and user agent info for a single obscure article in a short time window may be information from only one or two users, not really an aggregation. Things like common paths through the site should be safe so long as they are not provided with too much temporal resolution, limit themselves to existing articles, and limit themselves to either really common paths or breaking paths into two or three node chains and skip releasing the least common of those. Generally when dealing with private data you must approach it with the same attitude that a C coder must take to avoid buffer overflows. Treat all data as hostile, assume all actions are potentially dangerous. Try to figure out how to break it, and think deviously. On reflection I agree with you, though I think the biggest problem would actually be a case you didn't mention. If one provided timing and page view information, then one can almost certainly single out individual users by correlating the view timing with edit histories. Okay, so no stripped logs. The next question becomes what is the right way to aggregate. We can A) reinvent the wheel, or B) adapt a pre-existing log analyzer in a mode to produce clean aggregate data. While I respect the work of Zachte and others, this might be a case where B is a better near-term solution. Looking at http://stats.wikipedia.hu/cgi-bin/awstats.pl (the page that started this mess), his AWStats config already suppresses IP info and aggregates everything into groups that make it very hard to identify anything personal from. (There is still a small risk with allowing users to drill down to pages / requests that are almost never made, but perhaps that could be turned off.) AWStats has native support for Squid logs and is open source. This is not necessarily the only option, but I suspect that if we gave it some thought it would be possible to find an off-the-shelf tool that would be good enough to support many wikis and configurable enough to satisfy even the GMaxwell's of the world ;-). huwiki is actually the 20th largest wiki (by number of edits), so if it worked for them, then a tool like AWStats can probably work for most of the projects (which are not EN). -Robert Rohde ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Unbreaking statistics
Had to run and missed a couple of important items. One is that you can calculate the likelihood that a link is missing. (Its similar to Googles page rank) If the likelihood turns out to be to small you simply don't report anything. You also can skip reporting if you don't have any intervening search or search result. You can also analyze the link structure at the log server and skip logging of uninteresting items, but even more interesting, this can be done client side if sufficient information is embedded on the pages. For this, note that a high likelihood of a missing link implies few existing inbound links and then they can simply be embedded on the page itself. Analyzing a dumb log would be very costly indeed. John John at Darkstar skrev: I tried to convince myself to stay out of this thread, but this was somewhat interesting. ;) I'm not quite sure this will work out for every case, but my gross idea is like this: Imagine an user trying to get an answare about some kind of problem. He searches with Google and dumps into the most obvious article about (for example) breweries, even if he really want to know something about a specific beer (Groelch or Mac or whatever). He can't find anything about it so he makes an additional search (at the page hopefully), gets a result list, reads through a lot of articles and then finally finds what he searches for. Then he leaves. Now, imagine a function that pushes new visited pages on a small page list and a function popping that list each time the search result page is visited. The page list is stored in a cookie. This small page list is then reported to a special logging server by a AJAX request. It can't just piggyback as the final page usually will not lead to a new request, - the user simply leaves. Later a lot of such page lists an be analyzed and compared to the known link structure. If a pair of pages consistently emerges in the log without having a parent - child relation then you know a link is missing. Some guestimates says that you need more than 100 page views before something like this can detect obvious missing links. For Norwegian (bokmål) Wikipedia that is about 2-3 months of statistics for half the article base, but note that the accumulated stats would be rectified by the page redirect information from the database as a link very seldom are dropped it is usually added. Well, something like that. I was wondering about running a test case but given some previous discussion I concluded that I would get a go on this. It is also possible to analyze the article relations where the user goes back to Googles result list but that is somewhat more evolved. John Platonides skrev: John at Darkstar wrote: If someone wants to work on this I have some ideas to make something usefull out of this log, but I'm a bit short on time. Basically its two ideas that are really usefull; one is to figure out which articles are most interesting to show in a portal and the other is how to detect articles with missing linking between them. John How are you planning to detect articles which 'should have links' between them? ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] Unbreaking statistics
Hello, I see I've created quite a stir around, but so far nothing really useful popped up. :-( But I see that one from Neil: Yes, modifying the http://stats.grok.se/ systems looks like the way to go. For me it doesn't really seem to be, since it seems to be using an extremely dumbed down version of input, which only contains page views and [unreliable] byte counters. Most probably it would require large rewrites, and a magical new data source. What do people actually want to see from the traffic data? Do they want referrers, anonymized user trails, or what? Are you old enough to remember stats.wikipedia.org? As far as I remember originally it ran webalizer, then something else, then nothing. If you check a webalizer stat you'll see what's in it. We are using, or we used until our nice fellow editors broke it, awstats, which basically provides the same with more caching. Most used and useful stats are page views (daily and hourly stats are pretty useful too), referrers, visitor domain and provider stats, os and browser stats, screen resolution stats, bot activity stats, visitor duration and depth, among probably others. At a brief glance I could replicate the grok.se stats easily since it seems to work out of http://dammit.lt/wikistats/, but it's completely useless for anything beyond page hit count. Is there a possibility to write a code which process raw squid data? Who do I have to bribe? :-/ -- byte-byte, grin ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Unbreaking statistics
Peter Gervai wrote: Hello, I see I've created quite a stir around, but so far nothing really useful popped up. :-( But I see that one from Neil: Yes, modifying the http://stats.grok.se/ systems looks like the way to go. For me it doesn't really seem to be, since it seems to be using an extremely dumbed down version of input, which only contains page views and [unreliable] byte counters. Most probably it would require large rewrites, and a magical new data source. What do people actually want to see from the traffic data? Do they want referrers, anonymized user trails, or what? Are you old enough to remember stats.wikipedia.org? As far as I remember originally it ran webalizer, then something else, then nothing. If you check a webalizer stat you'll see what's in it. We are using, or we used until our nice fellow editors broke it, awstats, which basically provides the same with more caching. Most used and useful stats are page views (daily and hourly stats are pretty useful too), referrers, visitor domain and provider stats, os and browser stats, screen resolution stats, bot activity stats, visitor duration and depth, among probably others. At a brief glance I could replicate the grok.se stats easily since it seems to work out of http://dammit.lt/wikistats/, but it's completely useless for anything beyond page hit count. Is there a possibility to write a code which process raw squid data? Who do I have to bribe? :-/ We do have http://stats.wikimedia.org/ which includes things like http://stats.wikimedia.org/EN/VisitorsSampledLogOrigins.htm -- Alex (wikipedia:en:User:Mr.Z-man) ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Unbreaking statistics
On Fri, Jun 5, 2009 at 6:38 PM, Tim Starlingtstarl...@wikimedia.org wrote: Peter Gervai wrote: Is there a possibility to write a code which process raw squid data? Who do I have to bribe? :-/ Yes it's possible. You just need to write a script that accepts a log stream on stdin and builds the aggregate data from it. If you want access to IP addresses, it needs to run on our own servers with only anonymised data being passed on to the public. http://wikitech.wikimedia.org/view/Squid_logging http://wikitech.wikimedia.org/view/Squid_log_format How much of that is really considered private? IP addresses obviously, anything else? I'm wondering if a cheap and dirty solution (at least for the low traffic wikis) might be to write a script that simply scrubs the private information and makes the rest available for whatever applications people might want. -Robert Rohde ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Unbreaking statistics
On Fri, Jun 5, 2009 at 10:13 PM, Robert Rohderaro...@gmail.com wrote: On Fri, Jun 5, 2009 at 6:38 PM, Tim Starlingtstarl...@wikimedia.org wrote: Peter Gervai wrote: Is there a possibility to write a code which process raw squid data? Who do I have to bribe? :-/ Yes it's possible. You just need to write a script that accepts a log stream on stdin and builds the aggregate data from it. If you want access to IP addresses, it needs to run on our own servers with only anonymised data being passed on to the public. http://wikitech.wikimedia.org/view/Squid_logging http://wikitech.wikimedia.org/view/Squid_log_format How much of that is really considered private? IP addresses obviously, anything else? I'm wondering if a cheap and dirty solution (at least for the low traffic wikis) might be to write a script that simply scrubs the private information and makes the rest available for whatever applications people might want. There is a lot of private data in user agents (MSIE 4.123; WINNT 4.0; bouncing_ferret_toolbar_1.23 drunken_monkey_downloader_2.34 may be uniquely identifying). There is even private data titles if you don't sanitize carefully (/wiki/search?lookup=From%20rarohde%20To%20Gmaxwell%20OMG%20secret%20stuff%20lemme%20accidently%20paste%20it%20into%20the%20search%20box). There is private data in referrers (http://rarohde.com/url_that_only_rarohde_would_have_comefrom). Things which individually do not appear to disclose anything private can disclose private things (look at the people uniquely identified by AOL's 'anonymized' search data). On the flip side, aggregation can take private things (i.e. useragents; IP info; referrers) and convert it to non-private data: Top user agents; top referrers; highest traffic ASNs... but becomes potentially revealing if not done carefully: The 'top' network and user agent info for a single obscure article in a short time window may be information from only one or two users, not really an aggregation. Things like common paths through the site should be safe so long as they are not provided with too much temporal resolution, limit themselves to existing articles, and limit themselves to either really common paths or breaking paths into two or three node chains and skip releasing the least common of those. Generally when dealing with private data you must approach it with the same attitude that a C coder must take to avoid buffer overflows. Treat all data as hostile, assume all actions are potentially dangerous. Try to figure out how to break it, and think deviously. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Unbreaking statistics
Scrubbing log files to make the data private is hard work. You'd be impressed by what researchers have been able to do - taking purportedly anonymous data and using it to identify users en masse by correlating it with publicly available data from other sites such as Amazon, Facebook and Netflix. Make no doubt - if you don't do it carefully you will be the target of, in the best of cases, an academic researcher who wants to prove that you don't understand statistics. On Fri, Jun 5, 2009 at 8:13 PM, Robert Rohde raro...@gmail.com wrote: On Fri, Jun 5, 2009 at 6:38 PM, Tim Starlingtstarl...@wikimedia.org wrote: Peter Gervai wrote: Is there a possibility to write a code which process raw squid data? Who do I have to bribe? :-/ Yes it's possible. You just need to write a script that accepts a log stream on stdin and builds the aggregate data from it. If you want access to IP addresses, it needs to run on our own servers with only anonymised data being passed on to the public. http://wikitech.wikimedia.org/view/Squid_logging http://wikitech.wikimedia.org/view/Squid_log_format How much of that is really considered private? IP addresses obviously, anything else? I'm wondering if a cheap and dirty solution (at least for the low traffic wikis) might be to write a script that simply scrubs the private information and makes the rest available for whatever applications people might want. -Robert Rohde ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Unbreaking statistics
On Fri, Jun 5, 2009 at 9:20 PM, Gregory Maxwellgmaxw...@gmail.com wrote: On Fri, Jun 5, 2009 at 10:13 PM, Robert Rohderaro...@gmail.com wrote: There is a lot of private data in user agents (MSIE 4.123; WINNT 4.0; bouncing_ferret_toolbar_1.23 drunken_monkey_downloader_2.34 may be uniquely identifying). There is even private data titles if you don't sanitize carefully (/wiki/search?lookup=From%20rarohde%20To%20Gmaxwell%20OMG%20secret%20stuff%20lemme%20accidently%20paste%20it%20into%20the%20search%20box). There is private data in referrers (http://rarohde.com/url_that_only_rarohde_would_have_comefrom). Things which individually do not appear to disclose anything private can disclose private things (look at the people uniquely identified by AOL's 'anonymized' search data). On the flip side, aggregation can take private things (i.e. useragents; IP info; referrers) and convert it to non-private data: Top user agents; top referrers; highest traffic ASNs... but becomes potentially revealing if not done carefully: The 'top' network and user agent info for a single obscure article in a short time window may be information from only one or two users, not really an aggregation. Things like common paths through the site should be safe so long as they are not provided with too much temporal resolution, limit themselves to existing articles, and limit themselves to either really common paths or breaking paths into two or three node chains and skip releasing the least common of those. Generally when dealing with private data you must approach it with the same attitude that a C coder must take to avoid buffer overflows. Treat all data as hostile, assume all actions are potentially dangerous. Try to figure out how to break it, and think deviously. On reflection I agree with you, though I think the biggest problem would actually be a case you didn't mention. If one provided timing and page view information, then one can almost certainly single out individual users by correlating the view timing with edit histories. Okay, so no stripped logs. The next question becomes what is the right way to aggregate. We can A) reinvent the wheel, or B) adapt a pre-existing log analyzer in a mode to produce clean aggregate data. While I respect the work of Zachte and others, this might be a case where B is a better near-term solution. Looking at http://stats.wikipedia.hu/cgi-bin/awstats.pl (the page that started this mess), his AWStats config already suppresses IP info and aggregates everything into groups that make it very hard to identify anything personal from. (There is still a small risk with allowing users to drill down to pages / requests that are almost never made, but perhaps that could be turned off.) AWStats has native support for Squid logs and is open source. This is not necessarily the only option, but I suspect that if we gave it some thought it would be possible to find an off-the-shelf tool that would be good enough to support many wikis and configurable enough to satisfy even the GMaxwell's of the world ;-). huwiki is actually the 20th largest wiki (by number of edits), so if it worked for them, then a tool like AWStats can probably work for most of the projects (which are not EN). -Robert Rohde ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l