I've been using HTMLParser to scrape Web sites. The trouble with this is, there's a lot of malformed HTML out there. Real browsers have to be written to cope gracefully with this, but HTMLParser does not. Not only does it raise an exception, but the parser object then gets into a confused state after that so you cannot continue using it.
The way I'm currently working around this is to do a dummy pre-parsing run with a dummy (non-subclassed) HTMLParser object. Every time I hit HTMLParseError, I note the line number in a set of lines to skip, then create a new HTMLParser object and restart the scan from the beginning, skipping all the lines I've noted so far. Only when I get to the end without further errors do I do the proper parse with all my appropriate actions. -- http://mail.python.org/mailman/listinfo/python-list