Re: [pywikibot] about parsing the dump

Aaron Halfaker Mon, 18 Jan 2016 21:14:58 -0800

Here's an example using regular expressions and `mwxml` (a new offshoot of
mediawiki-utilities referenced above)
https://tools.wmflabs.org/paws/public/EpochFail/examples/mwxml.py.ipynb


The example extracts image links from English Wikipedia, but I imagine it
would work for you with little modification.

-Aaron

On Mon, Jan 18, 2016 at 6:23 PM, Luigi Assom <[email protected]>
wrote:

> hi, thank you.
>
> Where can I find documentation for an example to extract links
> https://github.com/earwig/mwparserfromhell
> or
>
> https://github.com/wikimedia/pywikibot-core/blob/master/pywikibot/xmlreader.py
> ?
>
> I'd be very grateful if you can point me to an example for links
> extraction and redirect.
> Shall I use them against the xml dump or as bot to api.wikimedia?
> I would like to use offline, but mwparserfromhell seems to use online
> against api.wikipedia..
>
> where are documentation of scripts in mediawiki.org?
>
> https://www.mediawiki.org/w/index.php?search=xmlparser&title=Special%3ASearch&go=Go
>
> thank you!
>
>
>
> On Mon, Jan 18, 2016 at 8:05 PM, Morten Wang <[email protected]> wrote:
>
>> An alternative is Aaron Halfaker's mediawiki-utilities (
>> https://pypi.python.org/pypi/mediawiki-utilities) and mwparserfromhell (
>> https://github.com/earwig/mwparserfromhell) to parse the wikitext to
>> extract the links, the latter is already a part of pywikibot, though.
>>
>>
>> Cheers,
>> Morten
>>
>>
>> On 18 January 2016 at 10:45, Amir Ladsgroup <[email protected]> wrote:
>>
>>> Hey,
>>> There is a really good module implemented in pywikibot called
>>> xmlreader.py
>>> <https://github.com/wikimedia/pywikibot-core/blob/master/pywikibot/xmlreader.py>.
>>> Also a help is built based on the source code
>>> <https://doc.wikimedia.org/pywikibot/api_ref/pywikibot.html#module-pywikibot.xmlreader>
>>> You can read the source code and write your own script. Some scripts also
>>> support xmlreader, read the manual for them in mediawiki.org
>>>
>>> Best
>>>
>>> On Mon, Jan 18, 2016 at 10:00 PM Luigi Assom <[email protected]>
>>> wrote:
>>>
>>>> hello hello!
>>>> about the use of pywikibot:
>>>> is it possible to use to parse the xml dump?
>>>>
>>>> I am interested in extracting links from pages (internal, external,
>>>> with distinction from ones belonging to category).
>>>> I also would like to handle transitive redirect.
>>>> I would like to process the dump, without accessing wiki, either access
>>>> wiki with proper limits in butch.
>>>>
>>>> Is there maybe something in the package already taking care of this ?
>>>> I 've seen in https://www.mediawiki.org/wiki/Manual:Pywikibot/Scripts
>>>> there is a "ghost" extracting_links.py" script,
>>>> I wonted to ask before re-inventing the wheel, and if pywikibot is
>>>> suitable tool for the purpose.
>>>>
>>>> Thank you,
>>>> L.
>>>> _______________________________________________
>>>> pywikibot mailing list
>>>> [email protected]
>>>> https://lists.wikimedia.org/mailman/listinfo/pywikibot
>>>>
>>>
>>> _______________________________________________
>>> pywikibot mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/pywikibot
>>>
>>>
>>
>> _______________________________________________
>> pywikibot mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/pywikibot
>>
>>
>
> _______________________________________________
> pywikibot mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/pywikibot
>
>

_______________________________________________
pywikibot mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/pywikibot

Re: [pywikibot] about parsing the dump

Reply via email to