[mwlib] mwlib for NLP, and cleaning up the API

Joel Nothman Sun, 09 Aug 2009 23:26:40 -0700

Hi,


I am trying to promote the use of mwlib in applications which process  
Wikipedia, Wiktionary, etc. for natural language processing purposes (e.g.  
it is used in my papers at http://www.joelnothman.com/research/).

Python is a popular language for NLP (e.g. http://www.nltk.org/);  
Wikipedia is lately a very important source of language and world  
knowledge; and mwlib can provide a fairly accurate parse-tree of MW  
markup, while quality assurance is left to PediaPress, so us researchers  
can focus on language technology (and occasionally push back to the mwlib  
tip).

NLP researchers would mostly want to use mwlib as a parser, and then  
process/extract elements in the parse tree, or convert it to cleaned  
paragraphs of text.

The fact that people want to use mwlib for things other than publishing  
books means that the API needs to be kept clean and fairly stable. It  
would be nice to get occasional changelogs so that we know when to update  
our working copies.

Unfortunately, things in the API seem to be getting messier. In an old  
checkout of mwlib, I can do:

>>> from mwlib import uparser
>>> dir(uparser.simpleparse(''))
  Article 'unknown': 0 children
['__class__', '__delattr__', '__dict__', '__doc__', '__eq__',  
'__getattribute__', '__hash__', '__init__', '__iter__', '__module__',  
'__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__',  
'__setattr__', '__str__', '__weakref__', '_asText', 'allchildren',  
'append', 'asText', 'caption', 'children', 'filter', 'find', 'hasContent',  
'show']

At the tip, I now get:
['__class__', '__delattr__', '__dict__', '__doc__', '__eq__',  
'__getattribute__', '__hash__', '__init__', '__iter__', '__module__',  
'__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__',  
'__setattr__', '__str__', '__weakref__', '_asText', '_get_text',  
'_set_text', '_text', 'align', 'allchildren', 'asText', 'blocknode',  
'caption', 'children', 'filter', 'find', 'frame', 'interwiki',  
'join_as_text', 'langlink', 'len', 'level', 'lineprefix', 'namespace',  
'ns', 'rawtagname', 'show', 'source', 'start', 't_2box_close',  
't_2box_open', 't_begin_table', 't_begintable', 't_break', 't_colon',  
't_column', 't_comment', 't_complex_article', 't_complex_caption',  
't_complex_compat', 't_complex_indent', 't_complex_line',  
't_complex_link', 't_complex_named_url', 't_complex_node',  
't_complex_preformatted', 't_complex_section', 't_complex_style',  
't_complex_table', 't_complex_table_cell', 't_complex_table_row',  
't_complex_tag', 't_end', 't_end_table', 't_endsection', 't_endtable',  
't_entity', 't_hrule', 't_html_tag', 't_html_tag_end', 't_http_url',  
't_item', 't_magicword', 't_newline', 't_pre', 't_row', 't_section',  
't_section_end', 't_semicolon', 't_singlequote', 't_special',  
't_tablecaption', 't_text', 't_uniq', 't_urllink', 't_vlist', 'tagname',  
'target', 'text', 'thumb', 'token2name', 'type', 'vlist']

Why does every parse tree node need all these attributes? Can we clean  
this up a little to make mwlib parse trees simpler to work with?

- Joel

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"mwlib" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [email protected]
For more options, visit this group at http://groups.google.com/group/mwlib?hl=en
-~----------~----~----~----~------~----~------~--~---

[mwlib] mwlib for NLP, and cleaning up the API

Reply via email to