"Fredrik Lundh" <[EMAIL PROTECTED]> writes: [...] > or, if you're going to parse HTML pages from many different sources, a > real parser: > > from HTMLParser import HTMLParser > > class MyHTMLParser(HTMLParser): > > def handle_starttag(self, tag, attrs): > if tag == "a": > for key, value in attrs: > if key == "href": > print value > > p = MyHTMLParser() > p.feed(text) > p.close() > > see: > > http://docs.python.org/lib/module-HTMLParser.html > http://docs.python.org/lib/htmlparser-example.html > http://www.rexx.com/~dkuhlman/quixote_htmlscraping.html
It's worth noting that module HTMLParser is less tolerant of the bad HTML you find in the real world than is module sgmllib, which has a similar interface. There are also third party libraries like BeautifulSoup and mxTidy that you may find useful for parsing "HTML as deployed" (ie. bad HTML, often). Also, htmllib is an extension to sgmllib, and will do your link parsing with even less effort: import htmllib, formatter, urllib2 pp = htmllib.HTMLParser(formatter.NullFormatter()) pp.feed(urllib2.urlopen("http://python.org/").read()) print pp.anchorlist Module HTMLParser does have better support for XHTML, though. John -- http://mail.python.org/mailman/listinfo/python-list