Re: [pywikibot] about parsing the dump

Luigi Assom Fri, 22 Jan 2016 15:03:38 -0800

Hi All,
and thank your suggestions.

I will have a look to all them, I started from 'mwparserfromhell' - helluwa
of a name !


I found my way with:

import mwparserfromhellAPI_URL = "https://en.wikipedia.org/w/api.php";
def parse(title):
    data = {"action": "query", "prop": "revisions", "rvlimit": 1,
            "rvprop": "content", "format": "json", "titles": title}
    raw = urlopen(API_URL, urlencode(data).encode()).read()
    res = json.loads(raw)
    text = res["query"]["pages"].values()[0]["revisions"][0]["*"]
    return mwparserfromhell.parse(text)

test = parse('DNA')
# and

test.filter_wikilinks()


Some links are like [[Gunther Stent|Stent, Gunther Siegmund]]

So with '|' in the middle. The first is the canonical form, but what does
the second token represent?
e.g. I try:

*parse('Stent, Gunther Siegmund')*

Traceback (most recent call last):

  File "<stdin>", line 1, in <module>

  File "<stdin>", line 6, in parse

KeyError: 'revisions'
What does this error mean? Is it a redirect?


A few more questions about these tools:

1. Where can I find a documentation about the use of methods in
mwparserfromhell?
e.g. wikilinks() method takes argumetns, which can I use?
Could not find much here:
http://mwparserfromhell.readthedocs.org/en/latest/api/mwparserfromhell.html?highlight=filter_templates#mwparserfromhell.wikicode.Wikicode.filter_templates

2. I would like to use with a generator from dump:
I understand this module would to do the job to fetch pages and pipe to mwh
[aka mwparserfromhell :)]
https://tools.wmflabs.org/paws/public/EpochFail/examples/mwxml.py.ipynb
Correct ?
Any doc around for this as well?

3. how to handle redirect AND/OR curid ?
as example, dbpedia analysed recursive redirect, they call it "transitiive
redirect".
Any module to handle transitive redirect / to handle redirect  (I would do
it recursively)

E.g. if I use example above:
*parse('dna')*

u'#REDIRECT [[DNA]] {{R from other capitalisation}}'

I would like to obtain already 'DNA', or even better a module returning the
_ID of page (so far I build it myself with a dictionary, I'd like to ask to
MW team if they already could suggest some tool to handle recursive
redirect more efficiently).




On Wed, Jan 20, 2016 at 8:27 AM, Morten Wang <[email protected]> wrote:

> Luigi,
>
> Here's an example where I use mwparserfromhell to extract links, see the
> analyse() method, particularly lines 24 and 36–44:
> https://github.com/nettrom/Wiki-Class/blob/master/wikiclass/features/metrics/wikitext.py
>
> You can download the dumps and use Aaron's mwxml library example to
> process them, for instance by modifying the code so that it uses
> mwparserfromhell to parse the revision text (although that requires far
> more processing time) instead of regular expressions.
>
>
> Cheers,
> Morten
>
>
> On 18 January 2016 at 16:23, Luigi Assom <[email protected]> wrote:
>
>> hi, thank you.
>>
>> Where can I find documentation for an example to extract links
>> https://github.com/earwig/mwparserfromhell
>> or
>>
>> https://github.com/wikimedia/pywikibot-core/blob/master/pywikibot/xmlreader.py
>> ?
>>
>> I'd be very grateful if you can point me to an example for links
>> extraction and redirect.
>> Shall I use them against the xml dump or as bot to api.wikimedia?
>> I would like to use offline, but mwparserfromhell seems to use online
>> against api.wikipedia..
>>
>> where are documentation of scripts in mediawiki.org?
>>
>> https://www.mediawiki.org/w/index.php?search=xmlparser&title=Special%3ASearch&go=Go
>>
>> thank you!
>>
>>
>>
>> On Mon, Jan 18, 2016 at 8:05 PM, Morten Wang <[email protected]> wrote:
>>
>>> An alternative is Aaron Halfaker's mediawiki-utilities (
>>> https://pypi.python.org/pypi/mediawiki-utilities) and mwparserfromhell (
>>> https://github.com/earwig/mwparserfromhell) to parse the wikitext to
>>> extract the links, the latter is already a part of pywikibot, though.
>>>
>>>
>>> Cheers,
>>> Morten
>>>
>>>
>>> On 18 January 2016 at 10:45, Amir Ladsgroup <[email protected]> wrote:
>>>
>>>> Hey,
>>>> There is a really good module implemented in pywikibot called
>>>> xmlreader.py
>>>> <https://github.com/wikimedia/pywikibot-core/blob/master/pywikibot/xmlreader.py>.
>>>> Also a help is built based on the source code
>>>> <https://doc.wikimedia.org/pywikibot/api_ref/pywikibot.html#module-pywikibot.xmlreader>
>>>> You can read the source code and write your own script. Some scripts also
>>>> support xmlreader, read the manual for them in mediawiki.org
>>>>
>>>> Best
>>>>
>>>> On Mon, Jan 18, 2016 at 10:00 PM Luigi Assom <[email protected]>
>>>> wrote:
>>>>
>>>>> hello hello!
>>>>> about the use of pywikibot:
>>>>> is it possible to use to parse the xml dump?
>>>>>
>>>>> I am interested in extracting links from pages (internal, external,
>>>>> with distinction from ones belonging to category).
>>>>> I also would like to handle transitive redirect.
>>>>> I would like to process the dump, without accessing wiki, either
>>>>> access wiki with proper limits in butch.
>>>>>
>>>>> Is there maybe something in the package already taking care of this ?
>>>>> I 've seen in https://www.mediawiki.org/wiki/Manual:Pywikibot/Scripts
>>>>> there is a "ghost" extracting_links.py" script,
>>>>> I wonted to ask before re-inventing the wheel, and if pywikibot is
>>>>> suitable tool for the purpose.
>>>>>
>>>>> Thank you,
>>>>> L.
>>>>> _______________________________________________
>>>>> pywikibot mailing list
>>>>> [email protected]
>>>>> https://lists.wikimedia.org/mailman/listinfo/pywikibot
>>>>>
>>>>
>>>> _______________________________________________
>>>> pywikibot mailing list
>>>> [email protected]
>>>> https://lists.wikimedia.org/mailman/listinfo/pywikibot
>>>>
>>>>
>>>
>>> _______________________________________________
>>> pywikibot mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/pywikibot
>>>
>>>
>>
>> _______________________________________________
>> pywikibot mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/pywikibot
>>
>>
>
> _______________________________________________
> pywikibot mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/pywikibot
>
>

_______________________________________________
pywikibot mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/pywikibot

Re: [pywikibot] about parsing the dump

Reply via email to