Hi All, and thank your suggestions. I will have a look to all them, I started from 'mwparserfromhell' - helluwa of a name !
I found my way with: import mwparserfromhellAPI_URL = "https://en.wikipedia.org/w/api.php" def parse(title): data = {"action": "query", "prop": "revisions", "rvlimit": 1, "rvprop": "content", "format": "json", "titles": title} raw = urlopen(API_URL, urlencode(data).encode()).read() res = json.loads(raw) text = res["query"]["pages"].values()[0]["revisions"][0]["*"] return mwparserfromhell.parse(text) test = parse('DNA') # and test.filter_wikilinks() Some links are like [[Gunther Stent|Stent, Gunther Siegmund]] So with '|' in the middle. The first is the canonical form, but what does the second token represent? e.g. I try: *parse('Stent, Gunther Siegmund')* Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<stdin>", line 6, in parse KeyError: 'revisions' What does this error mean? Is it a redirect? A few more questions about these tools: 1. Where can I find a documentation about the use of methods in mwparserfromhell? e.g. wikilinks() method takes argumetns, which can I use? Could not find much here: http://mwparserfromhell.readthedocs.org/en/latest/api/mwparserfromhell.html?highlight=filter_templates#mwparserfromhell.wikicode.Wikicode.filter_templates 2. I would like to use with a generator from dump: I understand this module would to do the job to fetch pages and pipe to mwh [aka mwparserfromhell :)] https://tools.wmflabs.org/paws/public/EpochFail/examples/mwxml.py.ipynb Correct ? Any doc around for this as well? 3. how to handle redirect AND/OR curid ? as example, dbpedia analysed recursive redirect, they call it "transitiive redirect". Any module to handle transitive redirect / to handle redirect (I would do it recursively) E.g. if I use example above: *parse('dna')* u'#REDIRECT [[DNA]] {{R from other capitalisation}}' I would like to obtain already 'DNA', or even better a module returning the _ID of page (so far I build it myself with a dictionary, I'd like to ask to MW team if they already could suggest some tool to handle recursive redirect more efficiently). On Wed, Jan 20, 2016 at 8:27 AM, Morten Wang <[email protected]> wrote: > Luigi, > > Here's an example where I use mwparserfromhell to extract links, see the > analyse() method, particularly lines 24 and 36–44: > https://github.com/nettrom/Wiki-Class/blob/master/wikiclass/features/metrics/wikitext.py > > You can download the dumps and use Aaron's mwxml library example to > process them, for instance by modifying the code so that it uses > mwparserfromhell to parse the revision text (although that requires far > more processing time) instead of regular expressions. > > > Cheers, > Morten > > > On 18 January 2016 at 16:23, Luigi Assom <[email protected]> wrote: > >> hi, thank you. >> >> Where can I find documentation for an example to extract links >> https://github.com/earwig/mwparserfromhell >> or >> >> https://github.com/wikimedia/pywikibot-core/blob/master/pywikibot/xmlreader.py >> ? >> >> I'd be very grateful if you can point me to an example for links >> extraction and redirect. >> Shall I use them against the xml dump or as bot to api.wikimedia? >> I would like to use offline, but mwparserfromhell seems to use online >> against api.wikipedia.. >> >> where are documentation of scripts in mediawiki.org? >> >> https://www.mediawiki.org/w/index.php?search=xmlparser&title=Special%3ASearch&go=Go >> >> thank you! >> >> >> >> On Mon, Jan 18, 2016 at 8:05 PM, Morten Wang <[email protected]> wrote: >> >>> An alternative is Aaron Halfaker's mediawiki-utilities ( >>> https://pypi.python.org/pypi/mediawiki-utilities) and mwparserfromhell ( >>> https://github.com/earwig/mwparserfromhell) to parse the wikitext to >>> extract the links, the latter is already a part of pywikibot, though. >>> >>> >>> Cheers, >>> Morten >>> >>> >>> On 18 January 2016 at 10:45, Amir Ladsgroup <[email protected]> wrote: >>> >>>> Hey, >>>> There is a really good module implemented in pywikibot called >>>> xmlreader.py >>>> <https://github.com/wikimedia/pywikibot-core/blob/master/pywikibot/xmlreader.py>. >>>> Also a help is built based on the source code >>>> <https://doc.wikimedia.org/pywikibot/api_ref/pywikibot.html#module-pywikibot.xmlreader> >>>> You can read the source code and write your own script. Some scripts also >>>> support xmlreader, read the manual for them in mediawiki.org >>>> >>>> Best >>>> >>>> On Mon, Jan 18, 2016 at 10:00 PM Luigi Assom <[email protected]> >>>> wrote: >>>> >>>>> hello hello! >>>>> about the use of pywikibot: >>>>> is it possible to use to parse the xml dump? >>>>> >>>>> I am interested in extracting links from pages (internal, external, >>>>> with distinction from ones belonging to category). >>>>> I also would like to handle transitive redirect. >>>>> I would like to process the dump, without accessing wiki, either >>>>> access wiki with proper limits in butch. >>>>> >>>>> Is there maybe something in the package already taking care of this ? >>>>> I 've seen in https://www.mediawiki.org/wiki/Manual:Pywikibot/Scripts >>>>> there is a "ghost" extracting_links.py" script, >>>>> I wonted to ask before re-inventing the wheel, and if pywikibot is >>>>> suitable tool for the purpose. >>>>> >>>>> Thank you, >>>>> L. >>>>> _______________________________________________ >>>>> pywikibot mailing list >>>>> [email protected] >>>>> https://lists.wikimedia.org/mailman/listinfo/pywikibot >>>>> >>>> >>>> _______________________________________________ >>>> pywikibot mailing list >>>> [email protected] >>>> https://lists.wikimedia.org/mailman/listinfo/pywikibot >>>> >>>> >>> >>> _______________________________________________ >>> pywikibot mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/pywikibot >>> >>> >> >> _______________________________________________ >> pywikibot mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/pywikibot >> >> > > _______________________________________________ > pywikibot mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/pywikibot > >
_______________________________________________ pywikibot mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/pywikibot
