On Nov 26, 4:03 am, MRAB <pyt...@mrabarnett.plus.com> wrote: > On 26/11/2010 03:28, Joe Goldthwaite wrote: > > I’m attempting to parse some basic tagged markup. The output of the > > TinyMCE editor returns a string that looks something like this; > > > > <p>This is a paragraph with <b>bold</b> and <i>italic</i> elements in > > it</p><p>It can be made up of multiple lines separated by pagagraph > > tags.</p> > > > > I’m trying to render the paragraph into a bit mapped image. I need > > to parse it out into the various paragraph and bold/italic pieces. > > I’m not sure the best way to approach it. Elementree and lxml seem > > to want a full formatted page, not a small segment like this one. > > When I tried to feed a line similar to the above to lxml I got an > > error; “XMLSyntaxError: Extra content at the end of the document”. > >
lxml works fine for me - have you tried: from lxml import html text = "<p>This is a paragraph with <b>bold</b> and <i>italic</i> elements in it</p><p>It can be made up of multiple lines separated by pagagraph tags.</p>" tree = html.fromstring(text) print tree.findall('p') # should print [<Element p at 2b7b458>, <Element p at 2b7b3e8>] hth Jon -- http://mail.python.org/mailman/listinfo/python-list