On Mon, Jan 21, 2013 at 3:45 PM, Marcin Mleczko <marcin.mlec...@onet.eu>wrote:
> Now I'm changing the input string to (adding an extra '<'): > > s = '<<html><head><title>Title</title>' > > and evoking the last command again: > > print re.match('<.*?>', s).group() > I would expect to get the same result > > <html> > > as I'm using the non-greedy pattern. What I get is > > <<html> > > Did I get the concept of non-greedy wrong or is this really a bug? > No, this is not a bug. Note first that you are using re.match, which only tries to match from the beginning of the string. If you want to match anywhere inside the string, you should use re.search, which returns the first match found. However even re.search will still return '<<html>' since that *is* a valid match of the regular expression '<.*?>', and re.search returns the first match it finds. in essence, re.search first tries calling match(regex, s), then match(regex, s[1:]), then match(regex, s[2:]) and so on and so on, moving on one character at the time until the regular expression produces a match. Since the regex produces a match on the first character, matching on the second isn't even tried. It is true that non-greedy matching will try to match the fewest number of characters possible. However, it will not cause the regular expression engine to backtrack, i.e. go back on parts of the pattern already matched and match them elsewhere to try and see if that produces a shorter match. If a greedy variant of a regex matches, then the non-greedy variant *will* also match at the same place. The only difference is the length of the result. more generally, regexes can not parse HTML fully since they simply lack the power. HTML is just not a regular language. If you want to parse arbitrary HTML documents, or even sufficiently complex HTML documents you should get a real HTML parser library (python includes one, check the docs). If you just want to grab some data from HTML tags it's probably ok to use regexes though, if you're careful. HTH, Hugo
_______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor