On 22/01/13 01:45, Marcin Mleczko wrote:

Now I'm changing the input string to (adding an extra '<'):

s = '<<html><head><title>Title</title>'

and evoking the last command again:

print re.match('<.*?>', s).group()
I would expect to get the same result

<html>

as I'm using the non-greedy pattern. What I get is

<<html>

Did I get the concept of non-greedy wrong or is this really a bug?


Definitely not a bug.


Your regex says:

"Match from the beginning of the string: less-than sign, then everything
up to the FIRST (non-greedy) greater-than sign."

So it matches the "<" at the beginning of the string, followed by the
"<html", followed by ">".


To get the result you are after, you could do this:

# Match two < signs, but only report from the second on
re.match('<(<.*?>)', s).group(1)


# Skip the first character
re.match('<.*?>', s[1:]).group()


# Don't match on < inside the <> tags
re.search('<[^<]*?>', s).group()


Notice that the last example must use re.search, not re.match,
because it does not match the beginning of the string.



By the way, you cannot parse general purpose HTML with a regular
expressions. You really should learn how to use Python's html
parsers, rather than trying to gerry-rig something that will do a
dodgy job.




--
Steven
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to