BeautifulSoup vs. loose & chars

John Nagle Mon, 25 Dec 2006 16:36:00 -0800

   I've been parsing existing HTML with BeautifulSoup, and occasionally
hit content which has something like "Design & Advertising", that is,
an "&" instead of an "&amp;".  Is there some way I can get BeautifulSoup
to clean those up?  There are various parsing options related to "&"
handling, but none of them seem to do quite the right thing.


   If I write the BeautifulSoup parse tree back out with "prettify",
the loose "&" is still in there.  So the output is
rejected by XML parsers.  Which is why this is a problem.
I need valid XML out, even if what went in wasn't quite valid.

                                John Nagle
-- 
http://mail.python.org/mailman/listinfo/python-list

BeautifulSoup vs. loose & chars

Reply via email to