Johnny Lee wrote: > I've met a problem in match a regular expression in python. Hope > any of you could help me. Here are the details: > > I have many tags like this: > xxx<a href="http://xxx.xxx.xxx" xxx>xxx > xxx<a href="wap://xxx.xxx.xxx" xxx>xxx > xxx<a href="http://xxx.xxx.xxx" xxx>xxx > ..... > And I want to find all the "http://xxx.xxx.xxx" out, so I do it > like this: > httpPat = re.compile("(<a )(href=\")(http://.*)(\")") > result = httpPat.findall(data) > I use this to observe my output: > for i in result: > print i[2] > Surprisingly I will get some output like this: > http://xxx.xxx.xxx">xxx</a>xxx > In fact it's filtered from this kind of source: > <a href="http://xxx.xxx.xxx">xxx</a>xxx" > But some result are right, I wonder how can I get the all the > answers clean like "http://xxx.xxx.xxx"? Thanks for your help.
".*" gives the longest possible match (you can think of it as searching back- wards from the right end). if you want to search for "everything until a given character", searching for "[^x]*x" is often a better choice than ".*x". in this case, I suggest using something like print re.findall("href=\"([^\"]+)\"", text) or, if you're going to parse HTML pages from many different sources, a real parser: from HTMLParser import HTMLParser class MyHTMLParser(HTMLParser): def handle_starttag(self, tag, attrs): if tag == "a": for key, value in attrs: if key == "href": print value p = MyHTMLParser() p.feed(text) p.close() see: http://docs.python.org/lib/module-HTMLParser.html http://docs.python.org/lib/htmlparser-example.html http://www.rexx.com/~dkuhlman/quixote_htmlscraping.html </F> -- http://mail.python.org/mailman/listinfo/python-list