Jan Eden wrote: >Hi, > >I use the following loop to parse some HTML code: > >for record in data: > try: > parser.feed(record['content']) > except HTMLParseError, (msg): > print "!!!Parsing error in", record['page_id'], ": ", msg > >Now after HTMLParser encounters a parse error in one record, it repeats to >execute the except statement for all following records - why is that? > >!!!Parsing error in 8832 : bad end tag: '</em b>', at line 56568, column >1647999 >!!!Parsing error in 8833 : bad end tag: '</em b>', at line 56568, column >1651394 >!!!Parsing error in 8834 : bad end tag: '</em b>', at line 56568, column >1654789 >!!!Parsing error in 8835 : bad end tag: '</em b>', at line 56568, column >1658184 > The parser processes up to the error. It never recovers from the error. HTMLParser has an internal buffer and buffer pointer that is never advanced when an error is detected; each time you call feed() it tries to parse the remaining data and gets the same error again. Take a look at HTMLParser.goahead() in Lib/HTMLParser.py if you are interested in the details.
IIRC HTMLParser is not noted for handling badly formed HTML. Beautiful Soup, ElementTidy, or HTML Scraper might be a better choice depending on what you are trying to do. Kent _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor