[Wiki-research-l] "Big data" benefits and limitations (relevance: WMF editor engagement, fundraising, and HR practices)

2012-12-29 Thread ENWP Pine



I'm sending this to Wikimedia-l, Wikitech-l, and Research-l in case other 
people in the Wikimedia movement or staff are interested in "big data" as it 
relates to Wikimedia. I hope that those who are interested in discussions about 
WMF editor engagement efforts, WMF fundraising, or WMF HR practices will also 
find that this email interests them. Feel free to skip straight to the links in 
the latter portion of this email if you're already familiar with "big data" and 
its analysis and if you just want to see what other people are writing about 
the subject.

* Introductory comments / my personal opinion

"Big data" refers to large quantities of information that are so large that 
they are difficult to analyze and may not be related internally in an obvious 
way. See https://en.wikipedia.org/wiki/Big_data

I think that most of us would agree that moving much of an organization's 
information into "the Cloud", and/or directing people to analyze massive 
quantities of information, will not automatically result in better, or even 
good, decisions based on that information. Also, I think that most of us would 
agree that bigger and/or more accessible quantities of data does not 
necessarily imply that the data are more accurate or more relevant for a 
particular purpose. Another concern is the possibility of unwelcome intrusions 
into sensitive information, including the possibility of data breaches; imagine 
the possible consequences if a hacker broke into supposedly secure databases 
held by Facebook or the Securities and Exchange Commission.

We have an enormous quantity of data on Wikimedia projects, and many ways that 
we can examine those data. As this  Dilbert strip points out, context is 
important, and looking at statistics devoid of their larger contexts can be 
problematic. http://dilbert.com/strips/comic/1993-02-07/

Since data analysis is also something that Wikipedia does in the areas I 
mentioned previously, I'm passing along a few links for those who may be 
interested about the benefits and limitations of big data.

* Links: 

>From the Harvard Business Review
http://hbr.org/2012/04/good-data-wont-guarantee-good-decisions/ar/1


>From the New York Times
https://www.nytimes.com/2012/12/30/technology/big-data-is-great-but-dont-forget-intuition.html
and
https://www.nytimes.com/2012/02/12/sunday-review/big-datas-impact-in-the-world.html


>From the Wall Street Journal. This may be especially interesting to those who 
>are participating in the discussions on Wikimedia-l regarding how Wikimedia 
>selects, pays, and manages its staff.
http://online.wsj.com/article/SB1872396390443890304578006252019616768.html


And from English Wikipedia (:
https://en.wikipedia.org/wiki/Big_data
and
https://en.wikipedia.org/wiki/Data_mining
and
https://en.wikipedia.org/wiki/Business_intelligence


Cheers,

Pine

  ___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] 2012 top pageview list

2012-12-29 Thread Andrew G. West

The WMF aggregates them as (page,views) pairs on an hourly basis:

http://dumps.wikimedia.org/other/pagecounts-raw/

I've been parsing these and storing them in a query-able DB format (for 
en.wp exclusively; though the files are available for all projects I 
think) for about two years. If you want to maintain such a fine 
granularity, it can quickly become a terrabyte scale task that eats up a 
lot of processing time.


If your looking for more coarse granularity reports (like top views for 
day, week, month) a lot of efficient aggregation can be done.


See also: http://en.wikipedia.org/wiki/Wikipedia:5000

Thanks, -AW


On 12/28/2012 07:28 PM, John Vandenberg wrote:

There is a steady stream of blogs and 'news' about these lists

https://encrypted.google.com/search?client=ubuntu&channel=fs&q=%22Sean+hoyland%22&ie=utf-8&oe=utf-8#q=wikipedia+top+2012&hl=en&safe=off&client=ubuntu&tbo=d&channel=fs&tbm=nws&source=lnt&tbs=qdr:w&sa=X&psj=1&ei=GzjeUOPpAsfnrAeQk4DgCg&ved=0CB4QpwUoAw&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.&bvm=bv.1355534169,d.aWM&fp=4e60e761ee133369&bpcl=40096503&biw=1024&bih=539

How does a researcher go about obtaining access logs with useragents
in order to answer some of these questions?



--
Andrew G. West, Doctoral Candidate
Dept. of Computer and Information Science
University of Pennsylvania, Philadelphia PA
Website: http://www.andrew-g-west.com

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l