Here's an example using regular expressions and `mwxml` (a new offshoot of mediawiki-utilities referenced above) https://tools.wmflabs.org/paws/public/EpochFail/examples/mwxml.py.ipynb
The example extracts image links from English Wikipedia, but I imagine it would work for you with little modification. -Aaron On Mon, Jan 18, 2016 at 6:23 PM, Luigi Assom <[email protected]> wrote: > hi, thank you. > > Where can I find documentation for an example to extract links > https://github.com/earwig/mwparserfromhell > or > > https://github.com/wikimedia/pywikibot-core/blob/master/pywikibot/xmlreader.py > ? > > I'd be very grateful if you can point me to an example for links > extraction and redirect. > Shall I use them against the xml dump or as bot to api.wikimedia? > I would like to use offline, but mwparserfromhell seems to use online > against api.wikipedia.. > > where are documentation of scripts in mediawiki.org? > > https://www.mediawiki.org/w/index.php?search=xmlparser&title=Special%3ASearch&go=Go > > thank you! > > > > On Mon, Jan 18, 2016 at 8:05 PM, Morten Wang <[email protected]> wrote: > >> An alternative is Aaron Halfaker's mediawiki-utilities ( >> https://pypi.python.org/pypi/mediawiki-utilities) and mwparserfromhell ( >> https://github.com/earwig/mwparserfromhell) to parse the wikitext to >> extract the links, the latter is already a part of pywikibot, though. >> >> >> Cheers, >> Morten >> >> >> On 18 January 2016 at 10:45, Amir Ladsgroup <[email protected]> wrote: >> >>> Hey, >>> There is a really good module implemented in pywikibot called >>> xmlreader.py >>> <https://github.com/wikimedia/pywikibot-core/blob/master/pywikibot/xmlreader.py>. >>> Also a help is built based on the source code >>> <https://doc.wikimedia.org/pywikibot/api_ref/pywikibot.html#module-pywikibot.xmlreader> >>> You can read the source code and write your own script. Some scripts also >>> support xmlreader, read the manual for them in mediawiki.org >>> >>> Best >>> >>> On Mon, Jan 18, 2016 at 10:00 PM Luigi Assom <[email protected]> >>> wrote: >>> >>>> hello hello! >>>> about the use of pywikibot: >>>> is it possible to use to parse the xml dump? >>>> >>>> I am interested in extracting links from pages (internal, external, >>>> with distinction from ones belonging to category). >>>> I also would like to handle transitive redirect. >>>> I would like to process the dump, without accessing wiki, either access >>>> wiki with proper limits in butch. >>>> >>>> Is there maybe something in the package already taking care of this ? >>>> I 've seen in https://www.mediawiki.org/wiki/Manual:Pywikibot/Scripts >>>> there is a "ghost" extracting_links.py" script, >>>> I wonted to ask before re-inventing the wheel, and if pywikibot is >>>> suitable tool for the purpose. >>>> >>>> Thank you, >>>> L. >>>> _______________________________________________ >>>> pywikibot mailing list >>>> [email protected] >>>> https://lists.wikimedia.org/mailman/listinfo/pywikibot >>>> >>> >>> _______________________________________________ >>> pywikibot mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/pywikibot >>> >>> >> >> _______________________________________________ >> pywikibot mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/pywikibot >> >> > > _______________________________________________ > pywikibot mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/pywikibot > >
_______________________________________________ pywikibot mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/pywikibot
