Fredrik Lundh wrote: > ".*" gives the longest possible match (you can think of it as searching back- > wards from the right end). if you want to search for "everything until a > given > character", searching for "[^x]*x" is often a better choice than ".*x". > > in this case, I suggest using something like > > print re.findall("href=\"([^\"]+)\"", text) > > or, if you're going to parse HTML pages from many different sources, a > real parser: > > from HTMLParser import HTMLParser > > class MyHTMLParser(HTMLParser): > > def handle_starttag(self, tag, attrs): > if tag == "a": > for key, value in attrs: > if key == "href": > print value > > p = MyHTMLParser() > p.feed(text) > p.close() > > see: > > http://docs.python.org/lib/module-HTMLParser.html > http://docs.python.org/lib/htmlparser-example.html > http://www.rexx.com/~dkuhlman/quixote_htmlscraping.html > > </F>
Thanks for your help. I found another solution by just simply adding a '?' after ".*" which makes the it searching for the minimal length to match the regular expression. To the HTMLParser, there is another problem (take my code for example): import urllib import formatter parser = htmllib.HTMLParser(formatter.NullFormatter()) parser.feed(urllib.urlopen(baseUrl).read()) parser.close() for url in parser.anchorlist: if url[0:7] == "http://": print url when the baseUrl="http://www.nba.com", there will raise an HTMLParseError because of a line of code "<! Copyright IBM Corporation, 2001, 2002 !>". I found that this line of code is inside <script> tags, maybe it's because of this? -- http://mail.python.org/mailman/listinfo/python-list