on 05.02.2007 03:49 John Nagle said the following: > (Was prevously posted as a followup to something else by accident.) > > I'm running a website page through BeautifulSoup. It parses OK > with Python 2.4, but Python 2.5 fails with an exception: > > Traceback (most recent call last): > File "./sitetruth/InfoSitePage.py", line 268, in httpfetch > self.pagetree = BeautifulSoup.BeautifulSoup(sitetext) # parse into tree > form > File "./sitetruth/BeautifulSoup.py", line 1326, in __init__ > BeautifulStoneSoup.__init__(self, *args, **kwargs) > File "./sitetruth/BeautifulSoup.py", line 973, in __init__ > self._feed() > File "./sitetruth/BeautifulSoup.py", line 998, in _feed > SGMLParser.feed(self, markup or "") > File "/usr/lib/python2.5/sgmllib.py", line 99, in feed > self.goahead(0) > File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead > k = self.parse_starttag(i) > File "/usr/lib/python2.5/sgmllib.py", line 291, in parse_starttag > self.finish_starttag(tag, attrs) > File "/usr/lib/python2.5/sgmllib.py", line 340, in finish_starttag > self.handle_starttag(tag, method, attrs) > File "/usr/lib/python2.5/sgmllib.py", line 376, in handle_starttag > method(attrs) > File "./sitetruth/BeautifulSoup.py", line 1416, in start_meta > self._feed(self.declaredHTMLEncoding) > File "./sitetruth/BeautifulSoup.py", line 998, in _feed > SGMLParser.feed(self, markup or "") > File "/usr/lib/python2.5/sgmllib.py", line 99, in feed > self.goahead(0) > File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead > k = self.parse_starttag(i) > File "/usr/lib/python2.5/sgmllib.py", line 285, in parse_starttag > self._convert_ref, attrvalue) > UnicodeDecodeError: 'ascii' codec can't decode byte 0xa7 in position 0: > ordinal > not in range(128) > > The code that's failing is in "_convert_ref", which is new in Python > 2.5. > That function wasn't present in 2.4. I think the code is trying to > handle single quotes inside of double quotes, or something like that. > > To replicate, run > > http://www.bankofamerica.com > or > http://www.gm.com > > through BeautifulSoup. > > Something about this code doesn't like big companies. Web sites of smaller > companies are going through OK. > > Also reported as a bug: > > [ 1651995 ] sgmllib _convert_ref UnicodeDecodeError exception, new in 2.5 > > > John Nagle
Hi, I had a similar problem recently and did not have time to file a bug-report. Thanks for doing that. The problem is the code that handles entity and character references in SGMLParser.parse_starttag. Seems that it is not careful about unicode/str issues. My quick'n'dirty workaround was to remove the offending char-entity from the website before feeding it to Beautifulsoup:: text = text.replace('®', '') # remove rights reserved sign entity cheers, stefan -- http://mail.python.org/mailman/listinfo/python-list