I agree that the existing rating of articles are not very useful. Many
articles are unassessed. Others were assessed within days of the article
being created and assessed as Stub/Start and have not been revisited since
despite considerable further development of the article. Some people work
very hard to get an article to GA (or whatever) and explicitly request
assessment. I would think most "high quality" articles have had people
actively working to achieve a high rating and explicitly requesting
assessment. I don't know how many articles get to high levels of quality
just through the uncoordinated contributions of the crowd but I bet it's
hardly enough. Indeed, I suspect if you train on high quality articles,
you'll learn that having a small number of editors doing a lot of work in
its recent history is the best indicator of quality.


If you are going to train your heuristics, I'd suggest collecting articles
which have had little/no further development since their last rating so that
you know the assessments have some chance of being accurate.


I doubt there is any single metric that is a predictor of quality but I
think citations is probably a good proxy. Of course, there are probably
counter-examples but generally an article with lots of citations suggests a
sincere effort at a better-quality article. Of course if any tool is
deployed to automatically assess article quality, then we can expect people
to "game" it, but at this stage one would assume that people are not
actively gaming the rating system while it has a manual assessment process.
However, people probably are "gaming" NPOV in specific articles by adding
lots of citations that support their views; I doubt any metric will allow
you to easily spot this kind of behaviour without doing some kind of
analysis of the sources and interrelationships between them.


But, as Laura comments, there may be a lot of citations clustered in a small
part of the article, but few elsewhere. Also, the number of sources is
relevant - I can cite the same source 1000 times in one article and that's
probably not quality either. I'd be inclined to reduce the influence of both
multiple citations at the same point of the text (or very close in the text)
as well as repeated citations to the same source. It's not that either is
bad but there should be some limit to how much they influence any





From: wiki-research-l-boun...@lists.wikimedia.org
[mailto:wiki-research-l-boun...@lists.wikimedia.org] On Behalf Of
Sent: Sunday, 15 December 2013 6:54 PM
To: Research into Wikimedia content and communities
Subject: Re: [Wiki-research-l] Existitng Research on Article


Re other dimensions or heuristics:

Very few articles are rated as Featured, and not that many as Good, if you
are going to use
nt>  that rating system I'd suggest also including the lower levels, and
indeed whether an article has been assessed and typically how long it takes
for a new article to be assessed. Uganda for example has 1
Featured article, 3 Good Articles and nearly 400 unassessed on the English
language Wikipedia.

For a crowd sourced project like Wikipedia the size of the crowd is crucial
and varies hugely per article. So I'd suggest counting the number of
different editors other than bots who have contributed to the article. It
might also be worth getting some measure of local internet speed or usage
level as context. There was a big upgrade to East Africa's Internet
connection a few years ago. For Wikipedia the crucial metric is the size of
the Internet comfortable population with some free time and ready access to
PCs, I'm not sure we've yet measured how long it takes from people getting
internet access to their being sufficiently confident to edit Wikipedia
articles, I suspect the answer is age related,  but it would be worth
checking the various editor surveys to see if this has been collected yet.
My understanding is that in much of Africa many people are bypassing the
whole PC thing and going straight to smartphones, and of course for
mobilephone users Wikipedia is essentially a queryable media rather than an
interactive editable one.

Whether or not a Wikipedia article has references is a quality dimension you
might want to look at. At least on EN it is widely assumed to be a measure
of quality, though I don't recall ever seeing a study of the relative
accuracy of cited and uncited Wikipedia information.

Thankfully the Article Feedback tool has been almost eradicated from the
English language Wikipedia, I don't know if it is still on French or
Swahili. I don't see it as being connected to the quality of article,
thouugh it should be an interesting measure of how loved or hated a given
celebrity was during the time the tool was deployed. So I'd suggest ignoring
it in your research on article quality.

Hope that helps



On 15 December 2013 06:15, Klein,Max <kle...@oclc.org> wrote:

Wiki Research Junkies,

I am investigating the comparative quality of articles about  Cote d'Ivoire
and Uganda versus other countries. I wanted to answer the question of what
makes high-quality articles? Can anyone point me to any existing research on
heuristics of Article Quality? That is, determining an articles quality by
the wikitext properties, without human rating? I would also consider using
data from the Article Feedback Tools, if there were dumps available for each
Article in English, French, and Swahili Wikipedias.  This is all the raw
data I can seem to find  http://toolserver.org/~dartar/aft5/dumps/

The heuristic technique that I currently using is training a naive Bayesian
filter based on:

*         Per Section.

o        Text length in each section

o        Infoboxes in each section.

*         Filled parameters in each infobox

o        Images in each section

*         Good Article, Featured Article?

*         Then Normalize on Page Views per on population / speakers of
native language

Can you also think of any other dimensions or heuristics to programatically



Maximilian Klein
Wikipedian in Residence, OCLC
+17074787023 <tel:%2B17074787023> 

Wiki-research-l mailing list


Wiki-research-l mailing list

Reply via email to