I don't know of a clean, language-independent way of grabbing all stubs. Stuart's suggestion is quite sensible, at least for English Wikipedia. When I last checked a few years ago, the mean length of an English language stub (on a log-scale) is around 1kB (including all markup), and they're quite much smaller than any other class.
I'd also see if the category system allows for some straightforward retrieval. English has https://en.wikipedia.org/wiki/Category:Stub_categories and https://en.wikipedia.org/wiki/Category:Stubs with quite a lot of links to other languages, which could be a good starting point. For some of the research we've done on quality, exploiting regularities in the category system using database access (in other words, LIKE-queries), is a quick way to grab most articles. A combination of both approaches might be a good way. If you're looking for even more thorough classification, grabbing a set and training a classifier might be the way to go. Cheers, Morten On 20 September 2016 at 02:40, Stuart A. Yeates <syea...@gmail.com> wrote: > en:WP:DYK has a measure of 1,500+ characters of prose, which is a useful > cutoff. There is weaponised javascript to measure that at en:WP:Did you > know/DYKcheck > > Probably doesn't translate to CJK languages which have radically different > information content per character. > > cheers > stuart > > -- > ...let us be heard from red core to black sky > > On Tue, Sep 20, 2016 at 9:26 PM, Robert West <w...@cs.stanford.edu> wrote: > >> Hi everyone, >> >> Does anyone know if there's a straightforward (ideally >> language-independent) way of identifying stub articles in Wikipedia? >> >> Whatever works is ok, whether it's publicly available data or data >> accessible only on the WMF cluster. >> >> I've found lists for various languages (e.g., Italian >> <https://it.wikipedia.org/wiki/Categoria:Stub> or English >> <https://en.wikipedia.org/wiki/Category:All_stub_articles>), but the >> lists are in different formats, so separate code is required for each >> language, which doesn't scale. >> >> I guess in the worst case, I'll have to grep for the respective stub >> templates in the respective wikitext dumps, but even this requires to know >> for each language what the respective template is. So if anyone could point >> me to a list of stub templates in different languages, that would also be >> appreciated. >> >> Thanks! >> Bob >> >> -- >> Up for a little language game? -- http://www.unfun.me >> >> _______________________________________________ >> Wiki-research-l mailing list >> Wiki-research-l@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >> >> > > _______________________________________________ > Wiki-research-l mailing list > Wiki-research-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > >
_______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l