If I were going to do this analysis[1], I'd use mediawiki-utilities to build an xml reader script that would use mwparserfromhell to parse a random sample of articles (1/1000 or so) and extract all headers by level to get a dataset with <page_id>, <header_level>, <heading>, <normal_header>
I'd do some simple normalization to lower case, remove punctuation and reduce all contiguous whitespace to a single space char between "words". Then I'd run an aggregation over that dataset to get your answer. If anyone wants to pick this up, I'm happy to advise. 1. which I might, but I'm unlikely to find time soon -Aaron On Mon, Jul 13, 2015 at 4:39 PM, Jonathan Morgan <jmor...@wikimedia.org> wrote: > You can get section titles (and hierarchy) directly from the API, though I > don't know if this approach scales the way you need it to: > https://en.wikipedia.org/w/api.php?action=parse&page=Albania&prop=sections&format=jsonfm > > On Mon, Jul 13, 2015 at 1:52 PM, Amir E. Aharoni < > amir.ahar...@mail.huji.ac.il> wrote: > >> Yes, that's the idea more or less, but I'm not sure that our search >> engine is able to search for headings, though I might be wrong. I suspect, >> however, that it will be required to process dumps article by article (or >> at least a random sample), and in big projects this could be extremely time >> consuming.But maybe there's a faster way of which I am not aware? >> >> >> -- >> Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי >> http://aharoni.wordpress.com >> “We're living in pieces, >> I want to live in peace.” – T. Moore >> >> 2015-07-13 23:41 GMT+03:00 Pine W <wiki.p...@gmail.com>: >> >>> Would it be possible to run a search on the full text of Wikipedias for >>> lines that start and end with "==", "===", "====", and lines that start >>> with ";", then make a list of those strings, and count the number of times >>> that each title appears in the list? >>> >>> Pine >>> On Jul 13, 2015 10:29 AM, "Jonathan Morgan" <jmor...@wikimedia.org> >>> wrote: >>> >>>> Cross-posting this request to wiki-research-l. Anyone have data on >>>> frequently used section titles in articles (any language), or know of >>>> datasets/publications that examined this? >>>> >>>> I'm not aware of any off the top of my head, Amir. >>>> >>>> - Jonathan >>>> >>>> ---------- Forwarded message ---------- >>>> From: Amir E. Aharoni <amir.ahar...@mail.huji.ac.il> >>>> Date: Sat, Jul 11, 2015 at 3:29 AM >>>> Subject: [Wikitech-l] statistics about frequent section titles >>>> To: Wikimedia developers <wikitec...@lists.wikimedia.org> >>>> >>>> >>>> Hi, >>>> >>>> Did anybody ever try to collect statistics about frequent section >>>> titles in >>>> Wikimedia projects? >>>> >>>> For Wikipedia, for example, titles such as "Biography", "Early life", >>>> "Bibliography", "External links", "References", "History", etc., appear >>>> in >>>> a lot of articles, and their counterparts appear in a lot of languages. >>>> >>>> There are probably similar things in Wikivoyage, Wiktionary and possibly >>>> other projects. >>>> >>>> Did anybody ever try to collect statistics of the most frequent section >>>> titles in each language and project? >>>> >>>> -- >>>> Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי >>>> http://aharoni.wordpress.com >>>> “We're living in pieces, >>>> I want to live in peace.” – T. Moore >>>> _______________________________________________ >>>> Wikitech-l mailing list >>>> wikitec...@lists.wikimedia.org >>>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l >>>> >>>> >>>> >>>> -- >>>> Jonathan T. Morgan >>>> Senior Design Researcher >>>> Wikimedia Foundation >>>> User:Jmorgan (WMF) <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)> >>>> >>>> >>>> _______________________________________________ >>>> Wiki-research-l mailing list >>>> Wiki-research-l@lists.wikimedia.org >>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >>>> >>>> >>> _______________________________________________ >>> Wiki-research-l mailing list >>> Wiki-research-l@lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >>> >>> >> >> _______________________________________________ >> Wiki-research-l mailing list >> Wiki-research-l@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >> >> > > > -- > Jonathan T. Morgan > Senior Design Researcher > Wikimedia Foundation > User:Jmorgan (WMF) <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)> > > > _______________________________________________ > Wiki-research-l mailing list > Wiki-research-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > >
_______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l