I think you are reading my comments too negatively. I’m not saying to ignore 
pageviews or incoming links. I’m saying that a naïve look at their stats may 
not be as useful as some of the variations I mention. I think it is worth 
looking at pageviews relative to those articles in the same WikiProject. I 
think it is worth looking at inbound links but to consider two groups, those 
coming from the same WikiProject(s) and from other WikiProjects. I think the 
position of the incoming links within their source articles is also 
significant, either first sentence, first para, whole of lede, or 
absolute/relative position of the link in the article (e.g. 2000 bytes from 
start, or 40% from start).

 

The big difference between machine-assessment of article quality and article 
importance is that quality is a metric on the article but importance is a 
metric on the topic. Also, my informal observation is that article quality does 
improve and degrade over time and hence is much more dynamic than topic 
importance, which seems to me to be much more stable. So I think there is less 
scope for dramatically improving the situation by being able to determine topic 
importance than the benefits likely to be achieved from automated quality 
assessment, but there may be benefit if there are heuristics to spot the 
relatively few articles which do need  importance re-assessed due to “current 
events”. In which case “editor activity” may be a metric, particularly “editor 
activity” on the lede para or other more critical areas of the article.

 

I am not too worried about 22nd century. I think we should look more at the 
next decade. Who would have predicted the demise of Usenet? It seemed pretty 
sexy at the time, etc. Wikipedia, like many things, will pass. It’s not to say 
it will pass into oblivion but it may morph into something very different to 
what it is today. Being CC-BY-SA improves the chances that any successor can 
build on it, but maybe we should put into WMF’s constitution, “if WMF shuts 
down, we release the contents of the projects as CC0” (to increase the 
likelihood that the content has a future). Having had to shut down a number of 
research institutes when the funding ran out, I know the utter stupidity occurs 
when they retain a skeleton of staff to “sell off all our valuable IP” which 
every closing-down institution seems to wants to do and the result is that the 
IP gets wasted because it isn’t sold or it’s sold to one of those companies who 
buy IP for tuppence on the off-chance they can potentially engage in patent 
litigation (or other IP litigation) downstream. We waste so much IP with this 
kind of “make a buck” thinking. <end of rant>

 

Kerry

 

From: Jane Darnell [mailto:jane...@gmail.com] 
Sent: Wednesday, 26 April 2017 5:51 PM
To: kerry.raym...@gmail.com; Research into Wikimedia content and communities 
<wiki-research-l@lists.wikimedia.org>
Subject: Re: [Wiki-research-l] Project exploring automated classification of 
article importance

 

Yes I totally agree that "importance is a relative metric rather than 
absolute." I also agree that incoming links and pageviews are not accurate 
measurements of "importance" for all of the reasons you mention. However, we 
are still a project that is actively exploring the universe of knowledge, and 
leaning heavily on academia and other established sources we must "boldly go 
where no man has gone before" (and please feel free to insert "white, 
euro-centric" before the man part). So do you have any suggestions what we 
could measure going forward that would cough up some interesting stats to 
monitor? Pagewatching is useful , but problematic because these are only 
assigned at page-creation, while some marginal editor interest might be 
expanded to whole categories (speaking as someone who has thousands of pages 
watchlisted on multiple projects). I like your thoughts about looking for key 
articles such as those used as the "article as the "main" article for a 
category or as the title of a navbox ".  I am looking for similar usages of 
paintings as a way to find popular painters or paintings rather than just those 
paintings which have articles written about them (which are often written for 
totally random reasons such as theft/sale/wikiproject).

 

On Wed, Apr 26, 2017 at 5:39 AM, Kerry Raymond <kerry.raym...@gmail.com 
<mailto:kerry.raym...@gmail.com> > wrote:

Just a few musings on the issue of Importance and how to research it ...

I agree it is intuitive that importance is likely to be linked to pageviews and 
inbound links but, as the preliminary experiment showed, it's probably not that 
simple.

Pageviews tells us something about importance to readers of Wikipedia, while 
inbound links tells us something about importance to writers of Wikipedia, and 
I suspect that writers are not a proxy for readers as the editor surveys 
suggest that Wikipedia writers are not typical of broader society on at least 
two variables: gender and level of education (might be others, I can't 
remember).

But I think importance is a relative metric rather than  absolute. I think by 
taking the mean value of importance across a number of WikiProjects in the 
preliminary experiment may have lost something because it tried (through 
averaging) to look at importance "generally". I would suspect conducting an 
experiment considering only the importance ratings wrt to a single WikiProject 
would be more likely to show correlation with pageviews (wrt to other articles 
in that same WikiProject) and inbound links. And I think there are two kinds of 
inbound links to be considered, those coming from other articles within the 
same WikiProject and those coming from outside that Wikiproject. I suspect 
different insights will be obtained by looking at both types of inbound links 
separately rather than treating them as an aggregate. I note also that 
WikiProjects are not entirely independent of one another but have relationships 
between them. For example, The WikiProject Australian Roads describes itself as 
an "intersection" (ha ha!) of WikiProject Highways and WikiProject Australia, 
so I expect that we would find greater correlation in importance between 
related WikiProjects than between unrelated WikiProjects.

When thinking about readers and pageviews, I think we have to ask ourselves is 
there a difference between popularity and importance. Or whether popularity 
*is* importance. I sense that, as a group of educated people, those of us 
reading this research mailing list probably do think there is a difference. 
Certainly if there is no difference, then this research can stop now -- just 
judge importance by  pageviews. Let's assume a difference then. When looking at 
pageviews of an article, they are not always consistent over time. Here are the 
pageviews for Drottninggatan

https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org 
<https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&range=latest-90&pages=Drottninggatan>
 &platform=all-access&agent=user&range=latest-90&pages=Drottninggatan

Why so interesting on 8 April? A terrorist attack occurred there. This spike in 
pageviews occurs all the time when some topic is in the news (even peripherally 
as in this case where it is not the article about the terrorist attack but 
about the street in which it occurred). Did the street become more "important"? 
I think it became more interesting but not more important. So I think we do 
have to be careful to understand that pageviews probably reflect interest 
rather than importance.  I note that The Chainsmokers (a music group with a 
number of songs in the current USA music charts) gets many more Wikipedia 
article pageviews  than the Wikipedia article on Pasteurization but The 
Chainsmokers are not rated as being of high importance by the relevant 
WikiProjects while Pasteurization is very important in WikiProject Food and 
Drink. Since pasteurisation prevents a lot of deaths, I think we might agree 
that in the real world pasteurisation is more important than a music group 
regardless of what pageviews tell us.

https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org 
<https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&range=latest-90&pages=The_Chainsmokers|Pasteurization>
 
&platform=all-access&agent=user&range=latest-90&pages=The_Chainsmokers|Pasteurization

Of course it is matters for Wikipedia's success that our *popular* articles are 
of high quality, but I think we have be cautious about pageviews being a proxy 
for importance.

When we look at Wikipedia writers' decisions in tagging the importance of 
articles to WikiProjects, what do we find? As we know, project tags are often 
placed on new articles (and often not subsequently reviewed). So while I find 
that quality tags are often out-of-date, the importance seems to be pretty 
accurate even on a new stub articles. This is because it is the importance of 
the *topic* that is being assessed which is independent of the Wikipedia 
article itself. Provided the article is clear enough about what it is about and 
why it matters (which is the traditional content of that first paragraph or two 
and failing to provide it will likely result in speedy deletion of the new 
article), assessment of the topic's importance can be made even at new stub 
level. This tells us that importance for Wikipedia writers is determined by 
something outside of Wikipedia (probably their real-world knowledge of that 
topic space -- one assumes that project taggers are quite interested in the 
topic space of that project). While article quality hopefully improves over 
time, I would be surprised if article importance greatly changed over time. 
Obviously there are counter-examples.  I am guessing Donald Trump's article may 
have grown in importance over time but that's probably because his lede para 
changed. Adding President of the USA into the lede paragraph makes him much 
more important than he was before in the real world and internal to Wikipedia 
he has acquired an inbound link from the presumably high-importance President 
of the USA article. So I think it might be interesting to study those articles 
whose importance does change over time to see if there are any strong 
correlations with what is happening to the article inside Wikipedia. I think it 
is this set of importance-changing articles may be where we really learn what 
Wikipedia article characteristics are strongly correlated to "importance" given 
that importance itself appears to be pretty stable for most articles.

Although not stated explicitly, I imagine we believe that generally less 
important articles tend to link to more important articles but more important 
articles don't link to less important articles. And hence in-bound links are 
likely to matter in assessing importance and that in-bound links from 
"important" articles are more valuable than in-bound links from less important 
articles (which creates something of a bootstrapping problem) similar to the 
issue to Google's PageRank algorithms. But I think we do have some information 
that Google doesn't have. The average webpage does not have a lede paragraph 
that situates the topic relative to other topics; a Wikipedia article does. If 
I have to choose to define Thing X in terms of Thing Y, it tends to suggest 
that Y is more important than X. If Y also defines itself in terms of X, then 
it tends to suggest they are equivalent in importance at some way. Indeed I 
suspect when we get to the VERY IMPORTANT topics we will see this kind of 
circular definition (e.g. you see circular definitions in Wikipedia around 
Philosophy and Knowledge). Aside, if you have never done this before, try this 
experiment. Choose a random article (left hand tool bar in Desktop Wikipedia), 
then click the first link in the article that matters (i.e. ignore links 
hatnotes or links inside parentheses). Repeat this first link clicking and 
sooner or later you will reach articles like Knowledge and Philosophy, which 
all sit inside circular definition groups.

If we look at the Donald Trump article, his first sentence contains only two 
links, one to List of Presidents of the USA and the other to President of the 
USA. If we look at the those two articles, we find that both of them mention 
Donald Trump in their lede paras (although not as early as the first sentence) 
and before mentions of any other US President elsewhere in the article. Which 
is consistent with what we know about the real world, the role of the President 
is more important than its officeholders and that the current officeholder has 
more importance than a past officeholder. So topic importance does seems to be 
skewed towards the "present day".

So I suspect the links in the lede paras are of greater relevance to the 
assessment of importance than links further down in the article which will be 
more likely relate to details of a topic and may include examples and 
counter-examples (this is a way in which high importance article may mention 
much lower importance articles). However, we do have to be a little bit careful 
here because of the MoS practice of not linking very common terms. For example, 
an Australian article will often refer to Australia in the lede para but it 
will almost certainly not be linked to the Australia article (and any attempt 
to add such a link will likely see it removed with an edit summary that 
mentions [[WP:Overlinking]]) whereas there is no problem if you link to an 
Australian state article, e.g. New South Wales. So we might find that some very 
important topics that often appear in ledes might get fewer links that you 
might expect because of the MoS policies on overlinking, which may be problem 
when working with inbound links. It may be that for "very common topics" the 
presence of the article title (or its synonyms) in the lede may have to be 
considered as if it were an in-bound link for statistical research purposes.

Given all of the above, perhaps the most interesting group of articles to study 
in Wikipedia are those articles whose manually-assessed importance has changed 
over the life of the article AND which were NOT current topics in the lifetime 
of Wikipedia (given the influence of "current" on importance). But having said 
that, I wonder if that group of articles actually exists. Recently a newish 
Australian contributor expressed disappointment that all the new articles they 
had created were tagged (by others) as of Low Importance. My instinctive reply 
was "that's normal, I think of the thousands of articles I have started only a 
couple even rated as Mid importance, this is because the really important 
articles were all started long ago precisely because they were important". I 
suspect topics that are very important (for reasons other than being 
short-lived importance due in being "current" in the lifetime of Wikipedia) 
will generally show up as having started early in Wikipedia's life and that 
those that become more/less important over time will be largely linked to 
becoming or ceasing to be "current" topics). E.g. article Pasteurization 
started in May 2001 saying nothing more than " Pasteurization is the process of 
killing off bacteria in milk by quickly heating it to a near boiling 
temperature, then quickly cooling it again before the taste and other desirable 
properties are affected. The process was named after its inventor, French 
scientist Louis Pasteur. See also dairy products." The links in this very first 
version are still present in its lede paragraph today, suggesting our 
understanding of "non-current" topics is stable and hence initial importance 
determinations can probably be accurately made. For Pasteurization the Talk 
page shows it was not project-tagged until 2007 when it was assigned High 
Importance as its first assessment.

I suspect we will find that initial manual assessment of article importance 
will be pretty accurate for most articles. And I suspect if we plot initial 
importance assessments against time of assessment, we will find the higher 
importance articles commenced life on Wikipedia earlier than the lower 
importance articles. If I am correct, then there isn't a lot of value in 
machine-assessment of importance of topics because it relates to factors 
external to Wikipedia and often does not change over time and therefore can 
often be correctly assessed manually even on new stub articles (and any 
unassessed articles can probably be rated as Low Importance as statistically 
that's almost certainly going to be correct). If a topic becomes more important 
due to "current" events, then invariably that article will be updated by many 
people and one of them will sooner or later manually adjust its importance. 
What is less likely to happen is re-assessing downwards of Importance when an 
important "current" topic loses its importance when it is no longer current, 
e.g. are former American presidents like Barack Obama or George W Bush or 
further back less important now? These articles will not be updated frequently 
once the topic is no longer in the news and therefore it is less likely an 
editor will notice and manually downgrade the importance, so there may be a 
greater role for machine-assessment in downgrading importance rather than 
upgrading importance.

Another area where there might be a role for machine-assessed importance in 
regards to POV-pushing where an POV-motivated editor might change the 
manual-assessment importance of articles to be higher or lower based on their 
POV (e.g. my political party is Top Importance, other parties are of Low 
Importance). I suspect that often a page watcher would correct or at least 
question that kind of re-assessment. However, articles with few active 
pagewatchers you might get away with POV-pushing the article's importance tag 
because nobody noticed. In this situation, a machine assessment could be useful 
in spotting this kind of thing.

This suggests that another metric of interest to importance might be number of 
pagewatchers, although I suspect that pagewatching may relate more to caring 
about the article than to caring about the topic. And one has to be careful to 
distinguish active pagewatchers (those who actually do review changes on their 
watchlists) from those who don't, as that may make a difference (although I am 
not sure we can really tell which pagewatchers are truly actively reviewing as 
a "satisfactory review" doesn't leave a trace whereas an "unsatisfactory" 
review is likely to lead to a relatively soon revert or some other change to 
the article, the article Talk or the User Talk of reviewed contributor which 
may be detectable).

The other aspect of articles that occurs to me as being possibly linked to 
importance of the topic would be use of the article as the "main" article for a 
category or as the title of a navbox (as it suggests that the articles in the 
category or navbox are in some way subordinate to the main/title article). 
Similarly for list articles, the "type" of the list is often more important 
than its instances).

Kerry

-----Original Message-----
From: Wiki-research-l [mailto:wiki-research-l-boun...@lists.wikimedia.org 
<mailto:wiki-research-l-boun...@lists.wikimedia.org> ] On Behalf Of Morten Wang
Sent: Friday, 21 April 2017 6:04 AM
To: Research into Wikimedia content and communities 
<wiki-research-l@lists.wikimedia.org 
<mailto:wiki-research-l@lists.wikimedia.org> >
Subject: Re: [Wiki-research-l] Project exploring automated classification of 
article importance

Hi Pine,

These are great pointers to existing practices on enwiki, some of which I've 
been looking for and/or missed, thanks!


Cheers,
Morten

On 19 April 2017 at 22:35, Pine W <wiki.p...@gmail.com 
<mailto:wiki.p...@gmail.com> > wrote:

> Hi Nettrom,
>
> A few resources from English Wikipedia regarding article importance as
> ranked by humans:
>
> https://en.wikipedia.org/wiki/Wikipedia:Vital_articles
>
> https://en.wikipedia.org/wiki/Wikipedia:Version_1.0_
> Editorial_Team/Release_Version_Criteria#Priority_of_topic
>
> https://en.wikipedia.org/wiki/Wikipedia:WikiProject_assessment#Statist
> ics
>
> I infer from the ENWP Wikicup's scoring protocol that for purposes of
> the competition, an article's "importance" is loosely inferred from
> the number of language editions of Wikipedia in which the article appears:
> https://en.wikipedia.org/wiki/Wikipedia:WikiCup/Scoring#Bonus_points.
>
> HTH,
>
> Pine
>
>
> On Tue, Apr 18, 2017 at 4:17 PM, Morten Wang <nett...@gmail.com 
> <mailto:nett...@gmail.com> > wrote:
>
> > Hello everyone,
> >
> > I am currently working with Aaron Halfaker and Dario Taraborelli at
> > the Wikimedia Foundation on a project exploring automated
> > classification of article importance. Our goal is to characterize
> > the importance of an article within a given context and design a
> > system to predict a relative importance rank. We have a project page
> > on meta[1] and welcome comments
> or
> > thoughts on our talk page. You can of course also respond here on
> > wiki-research-l, or send me an email.
> >
> > Before moving on to model-building I did a fairly thorough
> > literature review, finding a myriad of papers spanning several
> > disciplines. We have
> a
> > draft literature review also up on meta[2], which should give you a
> > reasonable introduction to the topic. Again, comments or thoughts (e.g.
> > papers we’ve missed) on the talk page, mailing list, or through
> > email are welcome.
> >
> > Links:
> >
> >    1. https://meta.wikimedia.org/wiki/Research:Automated_
> >    classification_of_article_importance
> >    <https://meta.wikimedia.org/wiki/Research:Automated_
> > classification_of_article_importance>
> >    2. https://meta.wikimedia.org/wiki/Research:Studies_of_Importance
> >
> > Regards,
> > Morten
> > [[User:Nettrom]] aka [[User:SuggestBot]]
> > _______________________________________________
> > Wiki-research-l mailing list
> > Wiki-research-l@lists.wikimedia.org 
> > <mailto:Wiki-research-l@lists.wikimedia.org> 
> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> >
> _______________________________________________
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org 
> <mailto:Wiki-research-l@lists.wikimedia.org> 
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org 
<mailto:Wiki-research-l@lists.wikimedia.org> 
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org 
<mailto:Wiki-research-l@lists.wikimedia.org> 
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

 

_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Reply via email to