Re: BeautifulSoup vs. loose & chars

placid Tue, 26 Dec 2006 04:26:02 -0800

John Nagle wrote:
> I've been parsing existing HTML with BeautifulSoup, and occasionally
> hit content which has something like "Design & Advertising", that is,
> an "&" instead of an "&amp;".  Is there some way I can get BeautifulSoup
> to clean those up?  There are various parsing options related to "&"
> handling, but none of them seem to do quite the right thing.
>
>    If I write the BeautifulSoup parse tree back out with "prettify",
> the loose "&" is still in there.  So the output is
> rejected by XML parsers.  Which is why this is a problem.
> I need valid XML out, even if what went in wasn't quite valid.
>
>                               John Nagle



So do you want to remove "&" or replace them with "&amp;" ? If you want
to replace it try the following;

import urllib, sys

try:
  location = urllib.urlopen(url)
except IOError, (errno, strerror):
  sys.exit("I/O error(%s): %s" % (errno, strerror))

content = location.read()
content = content.replace("&", "&amp;")


To do this with BeautifulSoup, i think you need to go through every
Tag, get its content, see if it contains an "&" and then replace the
Tag with the same Tag but the content contains "&amp;"

Hope this helps.
Cheers

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: BeautifulSoup vs. loose & chars

Reply via email to