Bugs item #1651995, was opened at 2007-02-04 22:34 Message generated for change (Tracker Item Submitted) made by Item Submitter You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1651995&group_id=5470
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Python Library Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: John Nagle (nagle) Assigned to: Nobody/Anonymous (nobody) Summary: sgmllib _convert_ref UnicodeDecodeError exception, new in 2. Initial Comment: I'm running a website page through BeautifulSoup. It parses OK with Python 2.4, but Python 2.5 fails with an exception: Traceback (most recent call last): File "./sitetruth/InfoSitePage.py", line 268, in httpfetch self.pagetree = BeautifulSoup.BeautifulSoup(sitetext) # parse into tree form File "./sitetruth/BeautifulSoup.py", line 1326, in __init__ BeautifulStoneSoup.__init__(self, *args, **kwargs) File "./sitetruth/BeautifulSoup.py", line 973, in __init__ self._feed() File "./sitetruth/BeautifulSoup.py", line 998, in _feed SGMLParser.feed(self, markup or "") File "/usr/lib/python2.5/sgmllib.py", line 99, in feed self.goahead(0) File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead k = self.parse_starttag(i) File "/usr/lib/python2.5/sgmllib.py", line 291, in parse_starttag self.finish_starttag(tag, attrs) File "/usr/lib/python2.5/sgmllib.py", line 340, in finish_starttag self.handle_starttag(tag, method, attrs) File "/usr/lib/python2.5/sgmllib.py", line 376, in handle_starttag method(attrs) File "./sitetruth/BeautifulSoup.py", line 1416, in start_meta self._feed(self.declaredHTMLEncoding) File "./sitetruth/BeautifulSoup.py", line 998, in _feed SGMLParser.feed(self, markup or "") File "/usr/lib/python2.5/sgmllib.py", line 99, in feed self.goahead(0) File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead k = self.parse_starttag(i) File "/usr/lib/python2.5/sgmllib.py", line 285, in parse_starttag self._convert_ref, attrvalue) UnicodeDecodeError: 'ascii' codec can't decode byte 0xa7 in position 0: ordinal not in range(128) The code that's failing is in "_convert_ref", which is new in Python 2.5. That function wasn't present in 2.4. I think the code is trying to handle single quotes inside of double quotes in HTML attributes, or something like that. To replicate, run http://www.bankofamerica.com or http://www.gm.com through BeautifulSoup. Something about this code doesn't like big companies. Web sites of smaller companies are going through OK. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1651995&group_id=5470 _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com