Lawrence D'Oliveiro wrote: > I've been using HTMLParser to scrape Web sites. The trouble with this > is, there's a lot of malformed HTML out there. Real browsers have to be > written to cope gracefully with this, but HTMLParser does not. Not only > does it raise an exception, but the parser object then gets into a > confused state after that so you cannot continue using it. > > The way I'm currently working around this is to do a dummy pre-parsing > run with a dummy (non-subclassed) HTMLParser object. Every time I hit > HTMLParseError, I note the line number in a set of lines to skip, then > create a new HTMLParser object and restart the scan from the beginning, > skipping all the lines I've noted so far. Only when I get to the end > without further errors do I do the proper parse with all my appropriate > actions.
You could try HTMLTidy (http://www.egenix.com/files/python/mxTidy.html) as a first step to get well formed HTML. Daniel -- http://mail.python.org/mailman/listinfo/python-list