You can get section titles (and hierarchy) directly from the API, though I don't know if this approach scales the way you need it to: https://en.wikipedia.org/w/api.php?action=parse&page=Albania&prop=sections&format=jsonfm
On Mon, Jul 13, 2015 at 1:52 PM, Amir E. Aharoni < amir.ahar...@mail.huji.ac.il> wrote: > Yes, that's the idea more or less, but I'm not sure that our search engine > is able to search for headings, though I might be wrong. I suspect, > however, that it will be required to process dumps article by article (or > at least a random sample), and in big projects this could be extremely time > consuming.But maybe there's a faster way of which I am not aware? > > > -- > Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי > http://aharoni.wordpress.com > “We're living in pieces, > I want to live in peace.” – T. Moore > > 2015-07-13 23:41 GMT+03:00 Pine W <wiki.p...@gmail.com>: > >> Would it be possible to run a search on the full text of Wikipedias for >> lines that start and end with "==", "===", "====", and lines that start >> with ";", then make a list of those strings, and count the number of times >> that each title appears in the list? >> >> Pine >> On Jul 13, 2015 10:29 AM, "Jonathan Morgan" <jmor...@wikimedia.org> >> wrote: >> >>> Cross-posting this request to wiki-research-l. Anyone have data on >>> frequently used section titles in articles (any language), or know of >>> datasets/publications that examined this? >>> >>> I'm not aware of any off the top of my head, Amir. >>> >>> - Jonathan >>> >>> ---------- Forwarded message ---------- >>> From: Amir E. Aharoni <amir.ahar...@mail.huji.ac.il> >>> Date: Sat, Jul 11, 2015 at 3:29 AM >>> Subject: [Wikitech-l] statistics about frequent section titles >>> To: Wikimedia developers <wikitec...@lists.wikimedia.org> >>> >>> >>> Hi, >>> >>> Did anybody ever try to collect statistics about frequent section titles >>> in >>> Wikimedia projects? >>> >>> For Wikipedia, for example, titles such as "Biography", "Early life", >>> "Bibliography", "External links", "References", "History", etc., appear >>> in >>> a lot of articles, and their counterparts appear in a lot of languages. >>> >>> There are probably similar things in Wikivoyage, Wiktionary and possibly >>> other projects. >>> >>> Did anybody ever try to collect statistics of the most frequent section >>> titles in each language and project? >>> >>> -- >>> Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי >>> http://aharoni.wordpress.com >>> “We're living in pieces, >>> I want to live in peace.” – T. Moore >>> _______________________________________________ >>> Wikitech-l mailing list >>> wikitec...@lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l >>> >>> >>> >>> -- >>> Jonathan T. Morgan >>> Senior Design Researcher >>> Wikimedia Foundation >>> User:Jmorgan (WMF) <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)> >>> >>> >>> _______________________________________________ >>> Wiki-research-l mailing list >>> Wiki-research-l@lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >>> >>> >> _______________________________________________ >> Wiki-research-l mailing list >> Wiki-research-l@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >> >> > > _______________________________________________ > Wiki-research-l mailing list > Wiki-research-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > > -- Jonathan T. Morgan Senior Design Researcher Wikimedia Foundation User:Jmorgan (WMF) <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)>
_______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l