I am writing a PyGTK application.  I would like to be able to download text 
only (with formatting) from Wikipedia and display it in my application.  I 
think that I am close to a solution, but I have reached an impasse due to my 
ignorance of most of the mediawiki API.

My plan has been to use GtkMozembed in my application to render the page, so I 
need to retrieve html.  What is close to working is to use the index.php API 
with action=render and title=<search string for the Wikipedia page>.  The 
data that I retrieve does display in my browser, but it has the following 
undesired characteristics:

1. All images appear (I want none).
2. There are sections at the end that I don't want (Further reading, External 
links, Notes, See also, References).
3. Some characters are not rendered correctly (e.g., IPA: [ˈvɔlfgaŋ 
amaˈdeus ˈmoːtsart]).

To fix 1 and 2, I could perhaps use an html parser and delete the offending 
items, but I wonder whether there is a proper solution using the mediawiki 
API (such as a prop parameter with which I could at least specify that I 
don't want any images).

I assume that 3 is a unicode problem, but I don't know what to do to fix it.
-- 
Jeffrey Barish

_______________________________________________
Mediawiki-api mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api

Reply via email to