Re: Parsing markup.

MRAB Thu, 25 Nov 2010 20:06:54 -0800

On 26/11/2010 03:28, Joe Goldthwaite wrote:
> I’m attempting to parse some basic tagged markup.  The output of the
> TinyMCE editor returns a string that looks something like this;
>
> <p>This is a paragraph with <b>bold</b> and <i>italic</i> elements in
> it</p><p>It can be made up of multiple lines separated by pagagraph
> tags.</p>
>
> I’m trying to render the paragraph into a bit mapped image.  I need
> to parse it out into the various paragraph and bold/italic pieces.
> I’m not sure the best way to approach it.  Elementree and lxml seem
> to want a full formatted page, not a small segment like this one.
> When I tried to feed a line similar to the above to lxml I got an
> error; “XMLSyntaxError: Extra content at the end of the document”.
>
I'd probably use a regex:


>>> import re

>>> text = "This is a paragraph with bold and italicelements in itIt can be made up of multiple lines separated bypagagraph tags."

>>> re.findall(r"</?\w+>|[^<>]+", text)

['', 'This is a paragraph with ', '', 'bold', '', ' and ','', 'italic', '', ' elements in it', '', '', 'It can bemade up of multiple lines separated by pagagraph tags.', '']

--
http://mail.python.org/mailman/listinfo/python-list

Re: Parsing markup.

Reply via email to