Hello everyone,

I've actually been parsing the raw data from 
[http://dammit.lt/wikistats/] daily into a MySQL database for over a 
year now. I also store statistics at hour-granularity, whereas 
[stats.grok.se] stores them at day granularity, it seems.

I only do this for en.wiki, and its certainly not efficient enough to 
open up for public use. However, I'd be willing to chat and share code 
with any interested developer. The strategy and schema are a bit 
awkward, but it works, and requires on average ~2 hours processing to 
store 24 hours worth of statistics.

Thanks, -AW


On 08/12/2011 04:49 AM, Domas Mituzas wrote:
> Hi!
>
>> Currently, if you want data on, for example, every article on the English
>> Wikipedia, you'd have to make 3.7 million individual HTTP requests to
>> Henrik's tool. At one per second, you're looking at over a month's worth of
>> continuous fetching. This is obviously not practical.
>
> Or you can download raw data.
>
>> A lot of people were waiting on Wikimedia's Open Web Analytics work to come
>> to fruition, but it seems that has been indefinitely put on hold. (Is that
>> right?)
>
> That project was pulsing with naiveness, if it ever had to be applied to wide 
> scope of all projects ;-)
>
>> Is it worth a Toolserver user's time to try to create a database of
>> per-project, per-page page view statistics?
>
> Creating such database is easy, making it efficient is a bit different :-)
>
>> And, of course, it wouldn't be a bad idea if Domas' first-pass 
>> implementation was improved on Wikimedia's side, regardless.
>
> My implementation is for obtaining raw data from our squid tier, what is 
> wrong with it?
> Generally I had ideas of making query-able data source - it isn't impossible 
> given a decent mix of data structures ;-)
>
>> Thoughts and comments welcome on this. There's a lot of desire to have a
>> usable system.
>
> Sure, interesting what people think could be useful with the dataset - we may 
> facilitate it.
>
>>   But short of believing that in
>> December 2010 "User Datagram Protocol" was more interesting to people
>> than Julian Assange you would need some other data source to make good
>> statistics.
>
> Yeah, "lies, damn lies and statistics". We need better statistics (adjusted 
> by wikipedian geekiness) than full page sample because you don't believe 
> general purpose wiki articles that people can use in their work can be more 
> popular than some random guy on the internet and trivia about him.
> Dracula is also more popular than Julian Assange, so is Jenna Jameson ;-)
>
>> http://stats.grok.se/de/201009/Ngai.cc would be another example.
>
>
> Unfortunately every time you add ability to spam something, people will spam. 
> There's also unintentional crap that ends up in HTTP requests because of 
> broken clients. It is easy to filter that out in postprocessing, if you want, 
> by applying article-exists bloom filter ;-)
>
>> If the stats.grok.se data actually captures nearly all requests, then I am 
>> not sure you realize how low the figures are.
>
> Low they are, Wikipedia's content is all about very long tail of data, 
> besides some heavily accessed head. Just graph top-100 or top-1000 and you 
> will see the shape of the curve:
> https://docs.google.com/spreadsheet/pub?hl=en_US&key=0AtHDNfVx0WNhdGhWVlQzRXZuU2podzR2YzdCMk04MlE&hl=en_US&gid=1
>
>> As someone with most of the skills and resources (with the exception of 
>> time, possibly) to create a page view stats database, reading something like 
>> this makes me think...
>
> Wow.
>
>> Yes, the data is susceptible to manipulation, both intentional and 
>> unintentional.
>
> I wonder how someone with most of skills and resources wants to solve this 
> problem (besides the aforementioned article-exists filter, which could reduce 
> dataset quite a lot ;)
>
>> ... you can begin doing real analysis work. Currently, this really isn't 
>> possible, and that's a Bad Thing.
>
> Raw data allows you to do whatever analysis you want. Shove it into SPSS/R/.. 
> ;-) Statistics much?
>
>> The main bottleneck has been that, like MZMcBride mentions, an underlying
>> database of page view data is unavailable.
>
> Underlying database is available, just not in easily queryable format. 
> There's a distinction there, unless you all imagine database as something you 
> send SQL to and it gives you data. Sorted files are databases too ;-)
> Anyway, I don't say that the project is impossible or unnecessary, but 
> there're lots of tradeoffs to be made - what kind of real time querying 
> workloads are to be expected, what kind of pre-filtering do people expect, 
> etc.
>
> Of course, we could always use OWA.
>
> Domas
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- 
Andrew G. West, Doctoral Student
Dept. of Computer and Information Science
University of Pennsylvania, Philadelphia PA
Email:   west...@cis.upenn.edu
Website: http://www.cis.upenn.edu/~westand

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to