Re: [Wikitech-l] Unbreaking statistics

2009-06-07 Thread John at Darkstar
Had to run and missed a couple of important items. One is that you can calculate the likelihood that a link is missing. (Its similar to Googles page rank) If the likelihood turns out to be to small you simply don't report anything. You also can skip reporting if you don't have any intervening searc

Re: [Wikitech-l] Unbreaking statistics

2009-06-07 Thread John at Darkstar
I tried to convince myself to stay out of this thread, but this was somewhat interesting. ;) I'm not quite sure this will work out for every case, but my gross idea is like this: Imagine an user trying to get an answare about some kind of problem. He searches with Google and dumps into the most o

Re: [Wikitech-l] Unbreaking statistics

2009-06-07 Thread Platonides
John at Darkstar wrote: > If someone wants to work on this I have some ideas to make something > usefull out of this log, but I'm a bit short on time. Basically its two > ideas that are really usefull; one is to figure out which articles are > most interesting to show in a portal and the other is h

Re: [Wikitech-l] Unbreaking statistics

2009-06-07 Thread John at Darkstar
Some articles are always very seldom referred and those can be used to uniquely identify a machine. Then there are all those who do something that goes into public logs. The later are very difficult to obfuscate, but the first one is possible to solve by setting a time frame long enough that suffic

Re: [Wikitech-l] Unbreaking statistics

2009-06-06 Thread John at Darkstar
If someone wants to work on this I have some ideas to make something usefull out of this log, but I'm a bit short on time. Basically its two ideas that are really usefull; one is to figure out which articles are most interesting to show in a portal and the other is how to detect articles with missi

Re: [Wikitech-l] Unbreaking statistics

2009-06-05 Thread Robert Rohde
On Fri, Jun 5, 2009 at 9:20 PM, Gregory Maxwell wrote: > On Fri, Jun 5, 2009 at 10:13 PM, Robert Rohde wrote: > There is a lot of private data in user agents ("MSIE 4.123; WINNT 4.0; > bouncing_ferret_toolbar_1.23 drunken_monkey_downloader_2.34" may be > uniquely identifying). There is even private

Re: [Wikitech-l] Unbreaking statistics

2009-06-05 Thread Brian
Scrubbing log files to make the data private is hard work. You'd be impressed by what researchers have been able to do - taking purportedly anonymous data and using it to identify users en masse by correlating it with publicly available data from other sites such as Amazon, Facebook and Netflix. Ma

Re: [Wikitech-l] Unbreaking statistics

2009-06-05 Thread Gregory Maxwell
On Fri, Jun 5, 2009 at 10:13 PM, Robert Rohde wrote: > On Fri, Jun 5, 2009 at 6:38 PM, Tim Starling wrote: >> Peter Gervai wrote: >>> Is there a possibility to write a code which process raw squid data? >>> Who do I have to bribe? :-/ >> >> Yes it's possible. You just need to write a script that ac

Re: [Wikitech-l] Unbreaking statistics

2009-06-05 Thread Robert Rohde
On Fri, Jun 5, 2009 at 6:38 PM, Tim Starling wrote: > Peter Gervai wrote: >> Is there a possibility to write a code which process raw squid data? >> Who do I have to bribe? :-/ > > Yes it's possible. You just need to write a script that accepts a log > stream on stdin and builds the aggregate data

Re: [Wikitech-l] Unbreaking statistics

2009-06-05 Thread Alex
Peter Gervai wrote: > Hello, > > I see I've created quite a stir around, but so far nothing really > useful popped up. :-( > > But I see that one from Neil: >> Yes, modifying the http://stats.grok.se/ systems looks like the way to go. > > For me it doesn't really seem to be, since it seems to be

Re: [Wikitech-l] Unbreaking statistics

2009-06-05 Thread Tim Starling
Peter Gervai wrote: > Is there a possibility to write a code which process raw squid data? > Who do I have to bribe? :-/ Yes it's possible. You just need to write a script that accepts a log stream on stdin and builds the aggregate data from it. If you want access to IP addresses, it needs to run

[Wikitech-l] Unbreaking statistics

2009-06-05 Thread Peter Gervai
Hello, I see I've created quite a stir around, but so far nothing really useful popped up. :-( But I see that one from Neil: > Yes, modifying the http://stats.grok.se/ systems looks like the way to go. For me it doesn't really seem to be, since it seems to be using an extremely dumbed down versi