On 26/11/2010 03:28, Joe Goldthwaite wrote:
> I’m attempting to parse some basic tagged markup. The output of the
> TinyMCE editor returns a string that looks something like this;
>
> <p>This is a paragraph with <b>bold</b> and <i>italic</i> elements in
> it</p><p>It can be made up of multiple lines separated by pagagraph
> tags.</p>
>
> I’m trying to render the paragraph into a bit mapped image. I need
> to parse it out into the various paragraph and bold/italic pieces.
> I’m not sure the best way to approach it. Elementree and lxml seem
> to want a full formatted page, not a small segment like this one.
> When I tried to feed a line similar to the above to lxml I got an
> error; “XMLSyntaxError: Extra content at the end of the document”.
>
I'd probably use a regex:
>>> import re
>>> text = "<p>This is a paragraph with <b>bold</b> and <i>italic</i>
elements in it</p><p>It can be made up of multiple lines separated by
pagagraph tags.</p>"
>>> re.findall(r"</?\w+>|[^<>]+", text)
['<p>', 'This is a paragraph with ', '<b>', 'bold', '</b>', ' and ',
'<i>', 'italic', '</i>', ' elements in it', '</p>', '<p>', 'It can be
made up of multiple lines separated by pagagraph tags.', '</p>']
--
http://mail.python.org/mailman/listinfo/python-list