Re: [Wiki-research-l] Fwd: [Wikitech-l] statistics about frequent section titles

2016-04-06 Thread Aaron Halfaker
Ran into some issues with downloading large files and forgot to post this
earlier.

*http://paws-public.wmflabs.org/paws-public/6877667/projects/headings/datasets/enwiki_20160204_headings.tsv.bz2
*

Columns:

   - "page_id" : int
  - The identifier of the article
   - "page_title"
  - The title of the article
   - "heading_level"
  - The level of the heading in question
   - "heading_text"
   - The text of the heading

Enjoy!

-Aaron

On Mon, Mar 7, 2016 at 6:52 PM, Yuvi Panda  wrote:

> Just also wanted to note that these paws-public URLs will break in the
> near-to-mid future :)
>
> On Mon, Mar 7, 2016 at 4:22 PM, Aaron Halfaker 
> wrote:
> > Got some work done here.  I'm using this as an opportunity to test out
> PAWS.
> >
> > See
> >
> http://paws-public.wmflabs.org/paws-public/EpochFail/projects/headings/extract_headings.ipynb
> >
> > It's still running right now, but I should have an output file that we
> can
> > download and/or load into MySQL soon.
> >
> > -Aaron
> >
> >
> > ___
> > Wiki-research-l mailing list
> > Wiki-research-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> >
>
>
>
> --
> Yuvi Panda T
> http://yuvi.in/blog
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Fwd: [Wikitech-l] statistics about frequent section titles

2016-03-07 Thread Yuvi Panda
Just also wanted to note that these paws-public URLs will break in the
near-to-mid future :)

On Mon, Mar 7, 2016 at 4:22 PM, Aaron Halfaker  wrote:
> Got some work done here.  I'm using this as an opportunity to test out PAWS.
>
> See
> http://paws-public.wmflabs.org/paws-public/EpochFail/projects/headings/extract_headings.ipynb
>
> It's still running right now, but I should have an output file that we can
> download and/or load into MySQL soon.
>
> -Aaron
>
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>



-- 
Yuvi Panda T
http://yuvi.in/blog

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Fwd: [Wikitech-l] statistics about frequent section titles

2016-03-07 Thread Aaron Halfaker
Got some work done here.  I'm using this as an opportunity to test out
PAWS.

See
http://paws-public.wmflabs.org/paws-public/EpochFail/projects/headings/extract_headings.ipynb

It's still running right now, but I should have an output file that we can
download and/or load into MySQL soon.

-Aaron
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Fwd: [Wikitech-l] statistics about frequent section titles

2016-03-02 Thread Tilman Bayer
Bumping this thread - has anyone made progress on this, for example to
determine the percentage of enwiki articles that contain one of these
standard sections?

(I'm also curious how Danny B - BCCed - generates the lists at
https://cs.wiktionary.org/wiki/User:Danny_B./Datamining/Nadpisy that
Petr mentioned earlier in this thread.)

On Wed, Jul 22, 2015 at 8:51 AM, Aaron Halfaker  wrote:
> If I were going to do this analysis[1], I'd use mediawiki-utilities to build
> an xml reader script that would use mwparserfromhell to parse a random
> sample of articles (1/1000 or so) and extract all headers by level to get a
> dataset with , , , 
>
> I'd do some simple normalization to lower case, remove punctuation and
> reduce all contiguous whitespace to a single space char between "words".
> Then I'd run an aggregation over that dataset to get your answer.
>
> If anyone wants to pick this up, I'm happy to advise.
>
> 1. which I might, but I'm unlikely to find time soon
>
> -Aaron
>
> On Mon, Jul 13, 2015 at 4:39 PM, Jonathan Morgan 
> wrote:
>>
>> You can get section titles (and hierarchy) directly from the API, though I
>> don't know if this approach scales the way you need it to:
>> https://en.wikipedia.org/w/api.php?action=parse&page=Albania&prop=sections&format=jsonfm
>>
>> On Mon, Jul 13, 2015 at 1:52 PM, Amir E. Aharoni
>>  wrote:
>>>
>>> Yes, that's the idea more or less, but I'm not sure that our search
>>> engine is able to search for headings, though I might be wrong. I suspect,
>>> however, that it will be required to process dumps article by article (or at
>>> least a random sample), and in big projects this could be extremely time
>>> consuming.But maybe there's a faster way of which I am not aware?
>>>
>>>
>>> --
>>> Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
>>> http://aharoni.wordpress.com
>>> ‪“We're living in pieces,
>>> I want to live in peace.” – T. Moore‬
>>>
>>> 2015-07-13 23:41 GMT+03:00 Pine W :

 Would it be possible to run a search on the full text of Wikipedias for
 lines that start and end with "==", "===", "", and lines that start 
 with
 ";", then make a list of those strings, and count the number of times that
 each title appears in the list?

 Pine

 On Jul 13, 2015 10:29 AM, "Jonathan Morgan" 
 wrote:
>
> Cross-posting this request to wiki-research-l. Anyone have data on
> frequently used section titles in articles (any language), or know of
> datasets/publications that examined this?
>
> I'm not aware of any off the top of my head, Amir.
>
> - Jonathan
>
> -- Forwarded message --
> From: Amir E. Aharoni 
> Date: Sat, Jul 11, 2015 at 3:29 AM
> Subject: [Wikitech-l] statistics about frequent section titles
> To: Wikimedia developers 
>
>
> Hi,
>
> Did anybody ever try to collect statistics about frequent section
> titles in
> Wikimedia projects?
>
> For Wikipedia, for example, titles such as "Biography", "Early life",
> "Bibliography", "External links", "References", "History", etc., appear
> in
> a lot of articles, and their counterparts appear in a lot of languages.
>
> There are probably similar things in Wikivoyage, Wiktionary and
> possibly
> other projects.
>
> Did anybody ever try to collect statistics of the most frequent section
> titles in each language and project?
>
> --
> Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
> http://aharoni.wordpress.com
> ‪“We're living in pieces,
> I want to live in peace.” – T. Moore‬
> ___
> Wikitech-l mailing list
> wikitec...@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>
>
> --
> Jonathan T. Morgan
> Senior Design Researcher
> Wikimedia Foundation
> User:Jmorgan (WMF)
>
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

>>>
>>>
>>> ___
>>> Wiki-research-l mailing list
>>> Wiki-research-l@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>>
>>
>>
>>
>> --
>> Jonathan T. Morgan
>> Senior Design Researcher
>> Wikimedia Foundation
>> User:Jmorgan (WMF)
>>
>>
>> ___
>> Wiki-research-l mailing list
>> Wiki-research-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.w

Re: [Wiki-research-l] Fwd: [Wikitech-l] statistics about frequent section titles

2015-07-22 Thread Aaron Halfaker
If I were going to do this analysis[1], I'd use mediawiki-utilities to
build an xml reader script that would use mwparserfromhell to parse a
random sample of articles (1/1000 or so) and extract all headers by level
to get a dataset with , , , 

I'd do some simple normalization to lower case, remove punctuation and
reduce all contiguous whitespace to a single space char between "words".
Then I'd run an aggregation over that dataset to get your answer.

If anyone wants to pick this up, I'm happy to advise.

1. which I might, but I'm unlikely to find time soon

-Aaron

On Mon, Jul 13, 2015 at 4:39 PM, Jonathan Morgan 
wrote:

> You can get section titles (and hierarchy) directly from the API, though I
> don't know if this approach scales the way you need it to:
> https://en.wikipedia.org/w/api.php?action=parse&page=Albania&prop=sections&format=jsonfm
>
> On Mon, Jul 13, 2015 at 1:52 PM, Amir E. Aharoni <
> amir.ahar...@mail.huji.ac.il> wrote:
>
>> Yes, that's the idea more or less, but I'm not sure that our search
>> engine is able to search for headings, though I might be wrong. I suspect,
>> however, that it will be required to process dumps article by article (or
>> at least a random sample), and in big projects this could be extremely time
>> consuming.But maybe there's a faster way of which I am not aware?
>>
>>
>> --
>> Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
>> http://aharoni.wordpress.com
>> ‪“We're living in pieces,
>> I want to live in peace.” – T. Moore‬
>>
>> 2015-07-13 23:41 GMT+03:00 Pine W :
>>
>>> Would it be possible to run a search on the full text of Wikipedias for
>>> lines that start and end with "==", "===", "", and lines that start
>>> with ";", then make a list of those strings, and count the number of times
>>> that each title appears in the list?
>>>
>>> Pine
>>> On Jul 13, 2015 10:29 AM, "Jonathan Morgan" 
>>> wrote:
>>>
 Cross-posting this request to wiki-research-l. Anyone have data on
 frequently used section titles in articles (any language), or know of
 datasets/publications that examined this?

 I'm not aware of any off the top of my head, Amir.

 - Jonathan

 -- Forwarded message --
 From: Amir E. Aharoni 
 Date: Sat, Jul 11, 2015 at 3:29 AM
 Subject: [Wikitech-l] statistics about frequent section titles
 To: Wikimedia developers 


 Hi,

 Did anybody ever try to collect statistics about frequent section
 titles in
 Wikimedia projects?

 For Wikipedia, for example, titles such as "Biography", "Early life",
 "Bibliography", "External links", "References", "History", etc., appear
 in
 a lot of articles, and their counterparts appear in a lot of languages.

 There are probably similar things in Wikivoyage, Wiktionary and possibly
 other projects.

 Did anybody ever try to collect statistics of the most frequent section
 titles in each language and project?

 --
 Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
 http://aharoni.wordpress.com
 ‪“We're living in pieces,
 I want to live in peace.” – T. Moore‬
 ___
 Wikitech-l mailing list
 wikitec...@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l



 --
 Jonathan T. Morgan
 Senior Design Researcher
 Wikimedia Foundation
 User:Jmorgan (WMF) 


 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


>>> ___
>>> Wiki-research-l mailing list
>>> Wiki-research-l@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>>
>>>
>>
>> ___
>> Wiki-research-l mailing list
>> Wiki-research-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>>
>
>
> --
> Jonathan T. Morgan
> Senior Design Researcher
> Wikimedia Foundation
> User:Jmorgan (WMF) 
>
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Fwd: [Wikitech-l] statistics about frequent section titles

2015-07-13 Thread Jonathan Morgan
You can get section titles (and hierarchy) directly from the API, though I
don't know if this approach scales the way you need it to:
https://en.wikipedia.org/w/api.php?action=parse&page=Albania&prop=sections&format=jsonfm

On Mon, Jul 13, 2015 at 1:52 PM, Amir E. Aharoni <
amir.ahar...@mail.huji.ac.il> wrote:

> Yes, that's the idea more or less, but I'm not sure that our search engine
> is able to search for headings, though I might be wrong. I suspect,
> however, that it will be required to process dumps article by article (or
> at least a random sample), and in big projects this could be extremely time
> consuming.But maybe there's a faster way of which I am not aware?
>
>
> --
> Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
> http://aharoni.wordpress.com
> ‪“We're living in pieces,
> I want to live in peace.” – T. Moore‬
>
> 2015-07-13 23:41 GMT+03:00 Pine W :
>
>> Would it be possible to run a search on the full text of Wikipedias for
>> lines that start and end with "==", "===", "", and lines that start
>> with ";", then make a list of those strings, and count the number of times
>> that each title appears in the list?
>>
>> Pine
>> On Jul 13, 2015 10:29 AM, "Jonathan Morgan" 
>> wrote:
>>
>>> Cross-posting this request to wiki-research-l. Anyone have data on
>>> frequently used section titles in articles (any language), or know of
>>> datasets/publications that examined this?
>>>
>>> I'm not aware of any off the top of my head, Amir.
>>>
>>> - Jonathan
>>>
>>> -- Forwarded message --
>>> From: Amir E. Aharoni 
>>> Date: Sat, Jul 11, 2015 at 3:29 AM
>>> Subject: [Wikitech-l] statistics about frequent section titles
>>> To: Wikimedia developers 
>>>
>>>
>>> Hi,
>>>
>>> Did anybody ever try to collect statistics about frequent section titles
>>> in
>>> Wikimedia projects?
>>>
>>> For Wikipedia, for example, titles such as "Biography", "Early life",
>>> "Bibliography", "External links", "References", "History", etc., appear
>>> in
>>> a lot of articles, and their counterparts appear in a lot of languages.
>>>
>>> There are probably similar things in Wikivoyage, Wiktionary and possibly
>>> other projects.
>>>
>>> Did anybody ever try to collect statistics of the most frequent section
>>> titles in each language and project?
>>>
>>> --
>>> Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
>>> http://aharoni.wordpress.com
>>> ‪“We're living in pieces,
>>> I want to live in peace.” – T. Moore‬
>>> ___
>>> Wikitech-l mailing list
>>> wikitec...@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>>
>>>
>>>
>>> --
>>> Jonathan T. Morgan
>>> Senior Design Researcher
>>> Wikimedia Foundation
>>> User:Jmorgan (WMF) 
>>>
>>>
>>> ___
>>> Wiki-research-l mailing list
>>> Wiki-research-l@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>>
>>>
>> ___
>> Wiki-research-l mailing list
>> Wiki-research-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>>
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>


-- 
Jonathan T. Morgan
Senior Design Researcher
Wikimedia Foundation
User:Jmorgan (WMF) 
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Fwd: [Wikitech-l] statistics about frequent section titles

2015-07-13 Thread Amir E. Aharoni
Yes, that's the idea more or less, but I'm not sure that our search engine
is able to search for headings, though I might be wrong. I suspect,
however, that it will be required to process dumps article by article (or
at least a random sample), and in big projects this could be extremely time
consuming.But maybe there's a faster way of which I am not aware?


--
Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
http://aharoni.wordpress.com
‪“We're living in pieces,
I want to live in peace.” – T. Moore‬

2015-07-13 23:41 GMT+03:00 Pine W :

> Would it be possible to run a search on the full text of Wikipedias for
> lines that start and end with "==", "===", "", and lines that start
> with ";", then make a list of those strings, and count the number of times
> that each title appears in the list?
>
> Pine
> On Jul 13, 2015 10:29 AM, "Jonathan Morgan"  wrote:
>
>> Cross-posting this request to wiki-research-l. Anyone have data on
>> frequently used section titles in articles (any language), or know of
>> datasets/publications that examined this?
>>
>> I'm not aware of any off the top of my head, Amir.
>>
>> - Jonathan
>>
>> -- Forwarded message --
>> From: Amir E. Aharoni 
>> Date: Sat, Jul 11, 2015 at 3:29 AM
>> Subject: [Wikitech-l] statistics about frequent section titles
>> To: Wikimedia developers 
>>
>>
>> Hi,
>>
>> Did anybody ever try to collect statistics about frequent section titles
>> in
>> Wikimedia projects?
>>
>> For Wikipedia, for example, titles such as "Biography", "Early life",
>> "Bibliography", "External links", "References", "History", etc., appear in
>> a lot of articles, and their counterparts appear in a lot of languages.
>>
>> There are probably similar things in Wikivoyage, Wiktionary and possibly
>> other projects.
>>
>> Did anybody ever try to collect statistics of the most frequent section
>> titles in each language and project?
>>
>> --
>> Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
>> http://aharoni.wordpress.com
>> ‪“We're living in pieces,
>> I want to live in peace.” – T. Moore‬
>> ___
>> Wikitech-l mailing list
>> wikitec...@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>
>>
>>
>> --
>> Jonathan T. Morgan
>> Senior Design Researcher
>> Wikimedia Foundation
>> User:Jmorgan (WMF) 
>>
>>
>> ___
>> Wiki-research-l mailing list
>> Wiki-research-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Fwd: [Wikitech-l] statistics about frequent section titles

2015-07-13 Thread Pine W
Would it be possible to run a search on the full text of Wikipedias for
lines that start and end with "==", "===", "", and lines that start
with ";", then make a list of those strings, and count the number of times
that each title appears in the list?

Pine
On Jul 13, 2015 10:29 AM, "Jonathan Morgan"  wrote:

> Cross-posting this request to wiki-research-l. Anyone have data on
> frequently used section titles in articles (any language), or know of
> datasets/publications that examined this?
>
> I'm not aware of any off the top of my head, Amir.
>
> - Jonathan
>
> -- Forwarded message --
> From: Amir E. Aharoni 
> Date: Sat, Jul 11, 2015 at 3:29 AM
> Subject: [Wikitech-l] statistics about frequent section titles
> To: Wikimedia developers 
>
>
> Hi,
>
> Did anybody ever try to collect statistics about frequent section titles in
> Wikimedia projects?
>
> For Wikipedia, for example, titles such as "Biography", "Early life",
> "Bibliography", "External links", "References", "History", etc., appear in
> a lot of articles, and their counterparts appear in a lot of languages.
>
> There are probably similar things in Wikivoyage, Wiktionary and possibly
> other projects.
>
> Did anybody ever try to collect statistics of the most frequent section
> titles in each language and project?
>
> --
> Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
> http://aharoni.wordpress.com
> ‪“We're living in pieces,
> I want to live in peace.” – T. Moore‬
> ___
> Wikitech-l mailing list
> wikitec...@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>
>
> --
> Jonathan T. Morgan
> Senior Design Researcher
> Wikimedia Foundation
> User:Jmorgan (WMF) 
>
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


[Wiki-research-l] Fwd: [Wikitech-l] statistics about frequent section titles

2015-07-13 Thread Jonathan Morgan
Cross-posting this request to wiki-research-l. Anyone have data on
frequently used section titles in articles (any language), or know of
datasets/publications that examined this?

I'm not aware of any off the top of my head, Amir.

- Jonathan

-- Forwarded message --
From: Amir E. Aharoni 
Date: Sat, Jul 11, 2015 at 3:29 AM
Subject: [Wikitech-l] statistics about frequent section titles
To: Wikimedia developers 


Hi,

Did anybody ever try to collect statistics about frequent section titles in
Wikimedia projects?

For Wikipedia, for example, titles such as "Biography", "Early life",
"Bibliography", "External links", "References", "History", etc., appear in
a lot of articles, and their counterparts appear in a lot of languages.

There are probably similar things in Wikivoyage, Wiktionary and possibly
other projects.

Did anybody ever try to collect statistics of the most frequent section
titles in each language and project?

--
Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
http://aharoni.wordpress.com
‪“We're living in pieces,
I want to live in peace.” – T. Moore‬
___
Wikitech-l mailing list
wikitec...@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l



-- 
Jonathan T. Morgan
Senior Design Researcher
Wikimedia Foundation
User:Jmorgan (WMF) 
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l