Bugs item #1144533, was opened at 2005-02-19 21:02 Message generated for change (Comment added) made by leogah You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1144533&group_id=5470
Category: Python Library Group: Python 2.3 Status: Open Resolution: None Priority: 5 Submitted By: Allan Hoeltje (ahoeltje) Assigned to: Nobody/Anonymous (nobody) Summary: htmllib quote parse error within a <script> Initial Comment: I am using the htmllib to parse web pages for plain text content. I came across a web page that contained a script construct similar to the example below. Note that the script is itself writing a script. The htmllib appears to be confused by the use of single and double quotes used within the real <script> and </script> tags. I am using "Python 2.3 (#1, Sep 13 2003, 00:49:11) [GCC 3.3 20030304 (Apple Computer, Inc. build 1495)] on darwin" on a PowerBook G4 running OSX 10.3.8. <html> <body> <h1> This is a test </h1> <br> <blockquote> <script language="JavaScript"> rnum = Math.round( Math.random() * 100000 ); document.write( '<scr' + 'ipt src="http://www.a.org/' + rnum + '/"></scr' + 'ipt>' ); </script> </blockquote> </body> </html> Here is the Python trace: Traceback (most recent call last): File "cleanFeed.py", line 26, in ? clean = stripHtml.strip( feed ) File "/Users/allan/Desktop/Mood for Today/stripHtml.py", line 144, in strip parser.feed(s) File "/System/Library/Frameworks/Python.framework/Versions/ 2.3/lib/python2.3/HTMLParser.py", line 108, in feed self.goahead(0) File "/System/Library/Frameworks/Python.framework/Versions/ 2.3/lib/python2.3/HTMLParser.py", line 150, in goahead k = self.parse_endtag(i) File "/System/Library/Frameworks/Python.framework/Versions/ 2.3/lib/python2.3/HTMLParser.py", line 327, in parse_endtag self.error("bad end tag: %s" % `rawdata[i:j]`) File "/System/Library/Frameworks/Python.framework/Versions/ 2.3/lib/python2.3/HTMLParser.py", line 115, in error raise HTMLParseError(message, self.getpos()) HTMLParser.HTMLParseError: bad end tag: "</scr' + 'ipt>", at line 1, column 309 ---------------------------------------------------------------------- Comment By: Richard Brodie (leogah) Date: 2005-03-09 00:51 Message: Logged In: YES user_id=356893 Generally speaking, you are better off conditioning random junk pulled off the web (with uTidylib or similar) before feeding it to HTMLParser, which tends to report errors when it finds them. See: http://www.w3.org/TR/html4/appendix/notes.html#h-B.3.2.1 for an explanation of why the error message is strictly correct. Someone may step in with a patch to make HTMLParser more tolerant in this case; there will always be something else though. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1144533&group_id=5470 _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com