I got a couple of private replies to this thread, so I figured I would just answer them publicly for the benefit of the list:

(1) Do I only parse/store English Wikipedia?

Yes; for scalability reasons and because that is my research focus. I'd consider opening my database to users with specific academic uses, but its probably not the most efficient way to do a lot of computations (see below). Plus, I transfer the older tables to offline drives, so I probably only have ~6 months of the most recent data online.


(2) Can you provide some insights into your parsing?

First, I began collecting this data for the purposes of:

http://repository.upenn.edu/cis_papers/470/

Where I knew the revision IDs of damaging revisions and wanted to reason about how many people saw that article/RID in its damaged state. This involved storing data on EVERY article at the finest granularity possible (hourly) and then assuming uniform intra-hour distributions.

See the URL below for my code (with the SQL server credentials blanked out) that does this work. A nightly [cron] task fires the Java code. It goes and downloads an entire days worth of files (24) and parses them. These files contain data for ALL WMF projects and languages, but I use a simple string match to only handle "en.wp" lines. Each column in the database represents a single day and contains a binary object wrapping (hour, hits) pairs. Each table contains 10 consecutive days of data. Much of this design was chosen to accommodate the extremely long tail and sparseness of the view distribution; filling a DB with billions of NULL values didn't prove to be too efficient in my first attempts. I think I use ~1TB yearly for the English Wikipedia data.

I would appreciate if anyone ends up using this code that my original work above would get a cite/acknowledgement. However, I imagine most will want to do a bit more aggregation, and hopefully this can provide a baseline for doing that.

Thanks, -AW


CODE LINK:
http://www.cis.upenn.edu/~westand/docs/wp_stats.zip



On 12/29/2012 11:06 PM, Andrew G. West wrote:
The WMF aggregates them as (page,views) pairs on an hourly basis:

http://dumps.wikimedia.org/other/pagecounts-raw/

I've been parsing these and storing them in a query-able DB format (for
en.wp exclusively; though the files are available for all projects I
think) for about two years. If you want to maintain such a fine
granularity, it can quickly become a terrabyte scale task that eats up a
lot of processing time.

If your looking for more coarse granularity reports (like top views for
day, week, month) a lot of efficient aggregation can be done.

See also: http://en.wikipedia.org/wiki/Wikipedia:5000

Thanks, -AW


On 12/28/2012 07:28 PM, John Vandenberg wrote:
There is a steady stream of blogs and 'news' about these lists

https://encrypted.google.com/search?client=ubuntu&channel=fs&q=%22Sean+hoyland%22&ie=utf-8&oe=utf-8#q=wikipedia+top+2012&hl=en&safe=off&client=ubuntu&tbo=d&channel=fs&tbm=nws&source=lnt&tbs=qdr:w&sa=X&psj=1&ei=GzjeUOPpAsfnrAeQk4DgCg&ved=0CB4QpwUoAw&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.&bvm=bv.1355534169,d.aWM&fp=4e60e761ee133369&bpcl=40096503&biw=1024&bih=539


How does a researcher go about obtaining access logs with useragents
in order to answer some of these questions?



--
Andrew G. West, Doctoral Candidate
Dept. of Computer and Information Science
University of Pennsylvania, Philadelphia PA
Email:   west...@cis.upenn.edu
Website: http://www.andrew-g-west.com

_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Reply via email to