Re: [mwlib] Using mwlib for parsing wikitext (but not PDF generation)

Travis Briggs Thu, 26 Apr 2012 10:19:56 -0700

Hi Joel,

My needs are pretty simple. The basic 'algorithm' of what I want to do is
identify section headers with their names:


if(isinstance(node, Section) and node.name == "External Links"):
      finish_node = node

Then, given the location in the document of a section header with a given
name, I want to take all the data in the document up to that point as plain
text. So, for :

"""
{{About|the rock band|their debut album|Foo Fighters (album)|the aerial
phenomenon|foo fighter}}
{{pp-move-indef}}
{{pp-semi|small=yes}}
{{Infobox musical artist
| name                = Foo Fighters
| image               = Foo Fighters 2007.jpg
| [...]}}

'''Foo Fighters''' is an&lt;!--Awards don't belong here--&gt; American
[[alternative rock]] band
"""

It becomes:

"Foo Fighters is an American alternative rock band"

I tried using uparser.simple_parse, but the results of Article.asText()
calls was very disappointing. mwlib.refine.compat.parse_text seems to give
much better results, but the infobox and other templates are still stuck in
the text.

And of course, my psuedo code is wrong, I still need to figure out how to
identify Sections with a certain name, and then collect nodes between the
head and that node.

All help is greatly appreciated, thanks,
-Travis

On 25 April 2012 19:43, Joel Nothman <jnoth...@student.usyd.edu.au> wrote:

>
> I have been using mwlib for exactly that since 2008, but I haven't checked
> if my scripts work with a more recent version of mwlib. (I mostly use
> mwlib.refine.compat.parse_**text.)
>
> I and others may be able to help you with more detail if you give us some
> idea what you would like to get out of the parse. For instance I needed
> standard structured Wikipedia features (category links, template
> information, etc.) as well as tokenised sentences with outgoing links as
> standoff annotations.
>
> - Joel
>
>
> On Thu, 26 Apr 2012 02:35:44 +1000, Travis Briggs <tra...@echonest.com>
> wrote:
>
>  Hello,
>>
>> Is there a way to get an abstract syntax tree from wikitext input
>> using mwlib? The documentation seems to only cover creating PDF or
>> some other documents.
>>
>> Thanks,
>> -Travis
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"mwlib" group.
To post to this group, send email to mwlib@googlegroups.com.
To unsubscribe from this group, send email to 
mwlib+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/mwlib?hl=en.

Re: [mwlib] Using mwlib for parsing wikitext (but not PDF generation)

Reply via email to