Re: Parsing (well, lexing, really) wikipedia markup

Brian Clapper Thu, 09 Jun 2011 04:39:18 -0700

On 06/09/2011 03:58 AM, A.T.Hofkamp wrote:

On 09/06/11 00:20, Sam Denton wrote:

I'm wanting to parse some Wikipedia pages.
Wikipedia template data looks like this: {{my template|arg one|arg
two|keyword=value}}
In a template definition, you can use variable expansion, like this:
{{{1|default for arg one}}}
I defined my lexer to grab runs of '{' and '}' and return different tokens
depending on the length of the run.
My problem is, I'm hitting cases where a template's name is a variable
expansion, resulting in: {{{{{keword}}}|arg one}}


If this is the only way they can be nested, you can use scanner states, that
is, define a scanner state 'outside template', which matches {{ only. when
encountering {{, switch to a 'inside template' scanner state which matches {{{
only. When encountering }}, switch back to the 'outside template' scanner state.

An alternative solution would be to use a scannerless parser. I am however not
sure whether these exist for Python.


Also, have you investigated the tools listed on this page?

http://www.mediawiki.org/wiki/Alternative_parsers

There are several Python solutions listed.
--
-Brian

Brian Clapper, http://www.clapper.org/bmc/
Weiler's Law:
        Nothing is impossible for the man who doesn't have to do it himself.

--
You received this message because you are subscribed to the Google Groups 
"ply-hack" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/ply-hack?hl=en.

Re: Parsing (well, lexing, really) wikipedia markup

Reply via email to