Hi,
I am trying to promote the use of mwlib in applications which process Wikipedia, Wiktionary, etc. for natural language processing purposes (e.g. it is used in my papers at http://www.joelnothman.com/research/). Python is a popular language for NLP (e.g. http://www.nltk.org/); Wikipedia is lately a very important source of language and world knowledge; and mwlib can provide a fairly accurate parse-tree of MW markup, while quality assurance is left to PediaPress, so us researchers can focus on language technology (and occasionally push back to the mwlib tip). NLP researchers would mostly want to use mwlib as a parser, and then process/extract elements in the parse tree, or convert it to cleaned paragraphs of text. The fact that people want to use mwlib for things other than publishing books means that the API needs to be kept clean and fairly stable. It would be nice to get occasional changelogs so that we know when to update our working copies. Unfortunately, things in the API seem to be getting messier. In an old checkout of mwlib, I can do: >>> from mwlib import uparser >>> dir(uparser.simpleparse('')) Article 'unknown': 0 children ['__class__', '__delattr__', '__dict__', '__doc__', '__eq__', '__getattribute__', '__hash__', '__init__', '__iter__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__str__', '__weakref__', '_asText', 'allchildren', 'append', 'asText', 'caption', 'children', 'filter', 'find', 'hasContent', 'show'] At the tip, I now get: ['__class__', '__delattr__', '__dict__', '__doc__', '__eq__', '__getattribute__', '__hash__', '__init__', '__iter__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__str__', '__weakref__', '_asText', '_get_text', '_set_text', '_text', 'align', 'allchildren', 'asText', 'blocknode', 'caption', 'children', 'filter', 'find', 'frame', 'interwiki', 'join_as_text', 'langlink', 'len', 'level', 'lineprefix', 'namespace', 'ns', 'rawtagname', 'show', 'source', 'start', 't_2box_close', 't_2box_open', 't_begin_table', 't_begintable', 't_break', 't_colon', 't_column', 't_comment', 't_complex_article', 't_complex_caption', 't_complex_compat', 't_complex_indent', 't_complex_line', 't_complex_link', 't_complex_named_url', 't_complex_node', 't_complex_preformatted', 't_complex_section', 't_complex_style', 't_complex_table', 't_complex_table_cell', 't_complex_table_row', 't_complex_tag', 't_end', 't_end_table', 't_endsection', 't_endtable', 't_entity', 't_hrule', 't_html_tag', 't_html_tag_end', 't_http_url', 't_item', 't_magicword', 't_newline', 't_pre', 't_row', 't_section', 't_section_end', 't_semicolon', 't_singlequote', 't_special', 't_tablecaption', 't_text', 't_uniq', 't_urllink', 't_vlist', 'tagname', 'target', 'text', 'thumb', 'token2name', 'type', 'vlist'] Why does every parse tree node need all these attributes? Can we clean this up a little to make mwlib parse trees simpler to work with? - Joel --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "mwlib" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/mwlib?hl=en -~----------~----~----~----~------~----~------~--~---
