Jan Eden wrote:

>Hi,
>
>I use the following loop to parse some HTML code:
>
>for record in data:
>    try:
>        parser.feed(record['content'])
>    except HTMLParseError, (msg):
>        print "!!!Parsing error in", record['page_id'], ": ", msg
>
>Now after HTMLParser encounters a parse error in one record, it repeats to 
>execute the except statement for all following records - why is that?
>
>!!!Parsing error in 8832 :  bad end tag: '</em b>', at line 56568, column 
>1647999
>!!!Parsing error in 8833 :  bad end tag: '</em b>', at line 56568, column 
>1651394
>!!!Parsing error in 8834 :  bad end tag: '</em b>', at line 56568, column 
>1654789
>!!!Parsing error in 8835 :  bad end tag: '</em b>', at line 56568, column 
>1658184
>
The parser processes up to the error. It never recovers from the error. 
HTMLParser has an internal buffer and buffer pointer that is never 
advanced when an error is detected; each time you call feed() it tries 
to parse the remaining data and gets the same error again. Take a look 
at HTMLParser.goahead() in Lib/HTMLParser.py if you are interested in 
the details.

IIRC HTMLParser is not noted for handling badly formed HTML. Beautiful 
Soup, ElementTidy, or HTML Scraper might be a better choice depending on 
what you are trying to do.

Kent

_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Reply via email to