Re: html parser?

Paul Boddie Tue, 18 Oct 2005 10:35:54 -0700

Thorsten Kampe wrote:
> For simple things like that "BeautifulSoup" might be overkill.


[HTMLParser example]

I've used SGMLParser with some success before, although the SAX-style
processing is objectionable to many people. One alternative is to use
libxml2dom [1] and to parse documents as HTML:

import libxml2dom, urllib
url = 'http://www.python.org'
doc = libxml2dom.parse(urllib.urlopen(url), html=1)
anchors = doc.xpath("//a")

Currently, the parseURI function in libxml2dom doesn't do HTML parsing,
mostly because I haven't yet figured out what combination of parsing
options have to be set to make it happen, but a combination of urllib
and libxml2dom should perform adequately. In the above example, you'd
process the nodes in the anchors list to get the desired results.

Paul

[1] http://www.python.org/pypi/libxml2dom

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: html parser?

Reply via email to