John Nagle wrote: > I've been parsing existing HTML with BeautifulSoup, and occasionally > hit content which has something like "Design & Advertising", that is, > an "&" instead of an "&". Is there some way I can get BeautifulSoup > to clean those up? There are various parsing options related to "&" > handling, but none of them seem to do quite the right thing. > > If I write the BeautifulSoup parse tree back out with "prettify", > the loose "&" is still in there. So the output is > rejected by XML parsers. Which is why this is a problem. > I need valid XML out, even if what went in wasn't quite valid. > > John Nagle
So do you want to remove "&" or replace them with "&" ? If you want to replace it try the following; import urllib, sys try: location = urllib.urlopen(url) except IOError, (errno, strerror): sys.exit("I/O error(%s): %s" % (errno, strerror)) content = location.read() content = content.replace("&", "&") To do this with BeautifulSoup, i think you need to go through every Tag, get its content, see if it contains an "&" and then replace the Tag with the same Tag but the content contains "&" Hope this helps. Cheers -- http://mail.python.org/mailman/listinfo/python-list