Re: Parsing markup.

Stefan Behnel Mon, 29 Nov 2010 08:37:38 -0800

Jon Clements, 26.11.2010 13:58:

On Nov 26, 4:03 am, MRAB<[email protected]>  wrote:

On 26/11/2010 03:28, Joe Goldthwaite wrote:
  >  I’m attempting to parse some basic tagged markup.  The output of the
  >  TinyMCE editor returns a string that looks something like this;
  >
  >  <p>This is a paragraph with<b>bold</b>  and<i>italic</i>  elements in
  >  it</p><p>It can be made up of multiple lines separated by pagagraph
  >  tags.</p>
  >
  >  I’m trying to render the paragraph into a bit mapped image.  I need
  >  to parse it out into the various paragraph and bold/italic pieces.
  >  I’m not sure the best way to approach it.  Elementree and lxml seem
  >  to want a full formatted page, not a small segment like this one.
  >  When I tried to feed a line similar to the above to lxml I got an
  >  error; “XMLSyntaxError: Extra content at the end of the document”.


This exception indicates that the OP is using the XML parser.

lxml works fine for me - have you tried:

from lxml import html
text = "<p>This is a paragraph with<b>bold</b>  and<i>italic</i>
elements in it</p><p>It can be made up of multiple lines separated by
pagagraph tags.</p>"
tree = html.fromstring(text)
print tree.findall('p')
# should print [<Element p at 2b7b458>,<Element p at 2b7b3e8>]


Yep, either use lxml.etree's HTML parser or lxml.html.

Stefan

--
http://mail.python.org/mailman/listinfo/python-list

Re: Parsing markup.

Reply via email to