Re: Help on regular expression match

John J. Lee Sat, 24 Sep 2005 04:03:17 -0700

"Fredrik Lundh" <[EMAIL PROTECTED]> writes:
[...]
> or, if you're going to parse HTML pages from many different sources, a
> real parser:
> 
>     from HTMLParser import HTMLParser
> 
>     class MyHTMLParser(HTMLParser):
> 
>         def handle_starttag(self, tag, attrs):
>             if tag == "a":
>                 for key, value in attrs:
>                     if key == "href":
>                         print value
> 
>     p = MyHTMLParser()
>     p.feed(text)
>     p.close()
> 
> see:
> 
>     http://docs.python.org/lib/module-HTMLParser.html
>     http://docs.python.org/lib/htmlparser-example.html
>     http://www.rexx.com/~dkuhlman/quixote_htmlscraping.html


It's worth noting that module HTMLParser is less tolerant of the bad
HTML you find in the real world than is module sgmllib, which has a
similar interface.  There are also third party libraries like
BeautifulSoup and mxTidy that you may find useful for parsing "HTML as
deployed" (ie. bad HTML, often).

Also, htmllib is an extension to sgmllib, and will do your link
parsing with even less effort:

import htmllib, formatter, urllib2
pp = htmllib.HTMLParser(formatter.NullFormatter())
pp.feed(urllib2.urlopen("http://python.org/";).read())
print pp.anchorlist


Module HTMLParser does have better support for XHTML, though.


John

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Help on regular expression match

Reply via email to