Re: [Wiki-research-l] 2012 top pageview list

Andrew G. West Tue, 01 Jan 2013 09:07:57 -0800

I got a couple of private replies to this thread, so I figured I wouldjust answer them publicly for the benefit of the list:


(1) Do I only parse/store English Wikipedia?

Yes; for scalability reasons and because that is my research focus. I'dconsider opening my database to users with specific academic uses, butits probably not the most efficient way to do a lot of computations (seebelow). Plus, I transfer the older tables to offline drives, so Iprobably only have ~6 months of the most recent data online.



(2) Can you provide some insights into your parsing?

First, I began collecting this data for the purposes of:

http://repository.upenn.edu/cis_papers/470/

Where I knew the revision IDs of damaging revisions and wanted to reasonabout how many people saw that article/RID in its damaged state. Thisinvolved storing data on EVERY article at the finest granularitypossible (hourly) and then assuming uniform intra-hour distributions.

See the URL below for my code (with the SQL server credentials blankedout) that does this work. A nightly [cron] task fires the Java code. Itgoes and downloads an entire days worth of files (24) and parses them.These files contain data for ALL WMF projects and languages, but I use asimple string match to only handle "en.wp" lines. Each column in thedatabase represents a single day and contains a binary object wrapping(hour, hits) pairs. Each table contains 10 consecutive days of data.Much of this design was chosen to accommodate the extremely long tailand sparseness of the view distribution; filling a DB with billions ofNULL values didn't prove to be too efficient in my first attempts. Ithink I use ~1TB yearly for the English Wikipedia data.

I would appreciate if anyone ends up using this code that my originalwork above would get a cite/acknowledgement. However, I imagine mostwill want to do a bit more aggregation, and hopefully this can provide abaseline for doing that.


Thanks, -AW


CODE LINK:
http://www.cis.upenn.edu/~westand/docs/wp_stats.zip



On 12/29/2012 11:06 PM, Andrew G. West wrote:

The WMF aggregates them as (page,views) pairs on an hourly basis:

http://dumps.wikimedia.org/other/pagecounts-raw/

I've been parsing these and storing them in a query-able DB format (for
en.wp exclusively; though the files are available for all projects I
think) for about two years. If you want to maintain such a fine
granularity, it can quickly become a terrabyte scale task that eats up a
lot of processing time.

If your looking for more coarse granularity reports (like top views for
day, week, month) a lot of efficient aggregation can be done.

See also: http://en.wikipedia.org/wiki/Wikipedia:5000

Thanks, -AW


On 12/28/2012 07:28 PM, John Vandenberg wrote:

There is a steady stream of blogs and 'news' about these lists

https://encrypted.google.com/search?client=ubuntu&channel=fs&q=%22Sean+hoyland%22&ie=utf-8&oe=utf-8#q=wikipedia+top+2012&hl=en&safe=off&client=ubuntu&tbo=d&channel=fs&tbm=nws&source=lnt&tbs=qdr:w&sa=X&psj=1&ei=GzjeUOPpAsfnrAeQk4DgCg&ved=0CB4QpwUoAw&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.&bvm=bv.1355534169,d.aWM&fp=4e60e761ee133369&bpcl=40096503&biw=1024&bih=539


How does a researcher go about obtaining access logs with useragents
in order to answer some of these questions?


--
Andrew G. West, Doctoral Candidate
Dept. of Computer and Information Science
University of Pennsylvania, Philadelphia PA
Email:   west...@cis.upenn.edu
Website: http://www.andrew-g-west.com

_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] 2012 top pageview list

Reply via email to