Re: [Wiki-research-l] Identifying Wikipedia stubs in various languages

Morten Wang Tue, 20 Sep 2016 10:03:06 -0700

I don't know of a clean, language-independent way of grabbing all stubs.
Stuart's suggestion is quite sensible, at least for English Wikipedia. When
I last checked a few years ago, the mean length of an English language stub
(on a log-scale) is around 1kB (including all markup), and they're quite
much smaller than any other class.


I'd also see if the category system allows for some straightforward
retrieval. English has
https://en.wikipedia.org/wiki/Category:Stub_categories and
https://en.wikipedia.org/wiki/Category:Stubs with quite a lot of links to
other languages, which could be a good starting point. For some of the
research we've done on quality, exploiting regularities in the category
system using database access (in other words, LIKE-queries), is a quick way
to grab most articles.

A combination of both approaches might be a good way. If you're looking for
even more thorough classification, grabbing a set and training a classifier
might be the way to go.


Cheers,
Morten


On 20 September 2016 at 02:40, Stuart A. Yeates <syea...@gmail.com> wrote:

> en:WP:DYK has a measure of 1,500+ characters of prose, which is a useful
> cutoff. There is weaponised javascript to measure that at en:WP:Did you
> know/DYKcheck
>
> Probably doesn't translate to CJK languages which have radically different
> information content per character.
>
> cheers
> stuart
>
> --
> ...let us be heard from red core to black sky
>
> On Tue, Sep 20, 2016 at 9:26 PM, Robert West <w...@cs.stanford.edu> wrote:
>
>> Hi everyone,
>>
>> Does anyone know if there's a straightforward (ideally
>> language-independent) way of identifying stub articles in Wikipedia?
>>
>> Whatever works is ok, whether it's publicly available data or data
>> accessible only on the WMF cluster.
>>
>> I've found lists for various languages (e.g., Italian
>> <https://it.wikipedia.org/wiki/Categoria:Stub> or English
>> <https://en.wikipedia.org/wiki/Category:All_stub_articles>), but the
>> lists are in different formats, so separate code is required for each
>> language, which doesn't scale.
>>
>> I guess in the worst case, I'll have to grep for the respective stub
>> templates in the respective wikitext dumps, but even this requires to know
>> for each language what the respective template is. So if anyone could point
>> me to a list of stub templates in different languages, that would also be
>> appreciated.
>>
>> Thanks!
>> Bob
>>
>> --
>> Up for a little language game? -- http://www.unfun.me
>>
>> _______________________________________________
>> Wiki-research-l mailing list
>> Wiki-research-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>>
>
> _______________________________________________
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>

_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Identifying Wikipedia stubs in various languages

Reply via email to