If I were going to do this analysis[1], I'd use mediawiki-utilities to
build an xml reader script that would use mwparserfromhell to parse a
random sample of articles (1/1000 or so) and extract all headers by level
to get a dataset with <page_id>, <header_level>, <heading>, <normal_header>

I'd do some simple normalization to lower case, remove punctuation and
reduce all contiguous whitespace to a single space char between "words".
Then I'd run an aggregation over that dataset to get your answer.

If anyone wants to pick this up, I'm happy to advise.

1. which I might, but I'm unlikely to find time soon

-Aaron

On Mon, Jul 13, 2015 at 4:39 PM, Jonathan Morgan <jmor...@wikimedia.org>
wrote:

> You can get section titles (and hierarchy) directly from the API, though I
> don't know if this approach scales the way you need it to:
> https://en.wikipedia.org/w/api.php?action=parse&page=Albania&prop=sections&format=jsonfm
>
> On Mon, Jul 13, 2015 at 1:52 PM, Amir E. Aharoni <
> amir.ahar...@mail.huji.ac.il> wrote:
>
>> Yes, that's the idea more or less, but I'm not sure that our search
>> engine is able to search for headings, though I might be wrong. I suspect,
>> however, that it will be required to process dumps article by article (or
>> at least a random sample), and in big projects this could be extremely time
>> consuming.But maybe there's a faster way of which I am not aware?
>>
>>
>> --
>> Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
>> http://aharoni.wordpress.com
>> ‪“We're living in pieces,
>> I want to live in peace.” – T. Moore‬
>>
>> 2015-07-13 23:41 GMT+03:00 Pine W <wiki.p...@gmail.com>:
>>
>>> Would it be possible to run a search on the full text of Wikipedias for
>>> lines that start and end with "==", "===", "====", and lines that start
>>> with ";", then make a list of those strings, and count the number of times
>>> that each title appears in the list?
>>>
>>> Pine
>>> On Jul 13, 2015 10:29 AM, "Jonathan Morgan" <jmor...@wikimedia.org>
>>> wrote:
>>>
>>>> Cross-posting this request to wiki-research-l. Anyone have data on
>>>> frequently used section titles in articles (any language), or know of
>>>> datasets/publications that examined this?
>>>>
>>>> I'm not aware of any off the top of my head, Amir.
>>>>
>>>> - Jonathan
>>>>
>>>> ---------- Forwarded message ----------
>>>> From: Amir E. Aharoni <amir.ahar...@mail.huji.ac.il>
>>>> Date: Sat, Jul 11, 2015 at 3:29 AM
>>>> Subject: [Wikitech-l] statistics about frequent section titles
>>>> To: Wikimedia developers <wikitec...@lists.wikimedia.org>
>>>>
>>>>
>>>> Hi,
>>>>
>>>> Did anybody ever try to collect statistics about frequent section
>>>> titles in
>>>> Wikimedia projects?
>>>>
>>>> For Wikipedia, for example, titles such as "Biography", "Early life",
>>>> "Bibliography", "External links", "References", "History", etc., appear
>>>> in
>>>> a lot of articles, and their counterparts appear in a lot of languages.
>>>>
>>>> There are probably similar things in Wikivoyage, Wiktionary and possibly
>>>> other projects.
>>>>
>>>> Did anybody ever try to collect statistics of the most frequent section
>>>> titles in each language and project?
>>>>
>>>> --
>>>> Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
>>>> http://aharoni.wordpress.com
>>>> ‪“We're living in pieces,
>>>> I want to live in peace.” – T. Moore‬
>>>> _______________________________________________
>>>> Wikitech-l mailing list
>>>> wikitec...@lists.wikimedia.org
>>>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>>>
>>>>
>>>>
>>>> --
>>>> Jonathan T. Morgan
>>>> Senior Design Researcher
>>>> Wikimedia Foundation
>>>> User:Jmorgan (WMF) <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)>
>>>>
>>>>
>>>> _______________________________________________
>>>> Wiki-research-l mailing list
>>>> Wiki-research-l@lists.wikimedia.org
>>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>>>
>>>>
>>> _______________________________________________
>>> Wiki-research-l mailing list
>>> Wiki-research-l@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>>
>>>
>>
>> _______________________________________________
>> Wiki-research-l mailing list
>> Wiki-research-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>>
>
>
> --
> Jonathan T. Morgan
> Senior Design Researcher
> Wikimedia Foundation
> User:Jmorgan (WMF) <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)>
>
>
> _______________________________________________
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Reply via email to