Re: HTMLParser fragility

Rene Pijlman Wed, 05 Apr 2006 03:50:44 -0700

Lawrence D'Oliveiro:
>I've been using HTMLParser to scrape Web sites. The trouble with this 
>is, there's a lot of malformed HTML out there. Real browsers have to be 
>written to cope gracefully with this, but HTMLParser does not.


There are two solutions to this:

1. Tidy the source before parsing it.
http://www.egenix.com/files/python/mxTidy.html

2. Use something more foregiving, like BeautifulSoup.
http://www.crummy.com/software/BeautifulSoup/

-- 
René Pijlman
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: HTMLParser fragility

Reply via email to