I'm having problems parsing an HTML file with the following syntax : <TABLE cellspacing=0 cellpadding=0 ALIGN=CENTER BORDER=1 width='100%'> <TH BGCOLOR='#c0c0c0' Width='3%'>User ID</TH> <TH Width='10%' BGCOLOR='#c0c0c0'>Name</TH><TH width='7%' BGCOLOR='#c0c0c0'>Date</TH> and so on....
whenever I feed the parser with such file I get the error : Traceback (most recent call last): File "C:\Documents and Settings\Administrator\My Documents\workspace \thread\src\parser.py", line 91, in <module> p.parse(thechange) File "C:\Documents and Settings\Administrator\My Documents\workspace \thread\src\parser.py", line 16, in parse self.feed(s) File "C:\Python25\lib\HTMLParser.py", line 110, in feed self.goahead(0) File "C:\Python25\lib\HTMLParser.py", line 152, in goahead k = self.parse_endtag(i) File "C:\Python25\lib\HTMLParser.py", line 316, in parse_endtag self.error("bad end tag: %r" % (rawdata[i:j],)) File "C:\Python25\lib\HTMLParser.py", line 117, in error raise HTMLParseError(message, self.getpos()) HTMLParser.HTMLParseError: bad end tag: "</TH BGCOLOR='#c0c0c0'>", at line 515, column 45 Googling around I've found a solution to a similar situation, over and over again : http://64.233.169.104/search?q=cache:zOmjwM_sGBcJ:coding.derkeiler.com/pdf/Archive/Python/comp.lang.python/2006-02/msg00026.pdf+CDATA_CONTENT_ELEMENTS&hl=pt-BR&ct=clnk&cd=5&gl=br&client=firefox-a but coding : you can disable proper parsing by setting the CDATA_CONTENT_ELEMENTS attribute on the parser instance, before you start parsing. by default, it is set to CDATA_CONTENT_ELEMENTS = ("script", "style") setting it to an empty tuple disables HTML-compliant handling for these elements: p = HTMLParser() p.CDATA_CONTENT_ELEMENTS = () p.feed(f.read()) didn't solve my problem. I've made a little modification then to HTMLParser.py instead that solved the problem, as follows: original: endtagfind = re.compile('</\s*([a-zA-Z][-.a-zA-Z0-9:_]*)?(.*) \s*>') my version : endtagfind = re.compile('</\s*([a-zA-Z][-.a-zA-Z0-9:_]*) \s*>') it worked ok for all the files I needed and also for a different file I also parse using the same library. I know it might sound stupid but I was just wondering if there's a better way of solving that problem than just modifying the standard library. Any clue ? thx in advance, Felipe. -- http://mail.python.org/mailman/listinfo/python-list