On Oct 28, 3:18 pm, Stefan Behnel <[EMAIL PROTECTED]> wrote: > Felipe De Bene wrote: > > I'm having problems parsing an HTML file with the following syntax : > > > <TABLE cellspacing=0 cellpadding=0 ALIGN=CENTER BORDER=1 width='100%'> > > <TH BGCOLOR='#c0c0c0' Width='3%'>User ID</TH> > > <TH Width='10%' BGCOLOR='#c0c0c0'>Name</TH><TH width='7%' > > BGCOLOR='#c0c0c0'>Date</TH> > > and so on.... > > > whenever I feed the parser with such file I get the error : > > > HTMLParser.HTMLParseError: bad end tag: "</TH BGCOLOR='#c0c0c0'>", at > > line 515, column 45 > > Your HTML page is not HTML, i.e. it is broken. Python's HTMLParser is not made > for parsing broken HTML. However, you can use the parse of lxml.html to fix up > your HTML for you. > > http://codespeak.net/lxml/ > > Stefan
It doesn't just choke on bad HTML, it also chokes on javascript that writes HTML, e.g. document.write('<scr'+'ipt language="javascript1.1" src="http:/... will also result in an error. However, when I did: parser = aqparser() #An implementation of HTMLParser parser.CDATA_CONTENT_ELEMENTS = () it worked. Strange... -Peter -- http://mail.python.org/mailman/listinfo/python-list