On 22/01/13 01:45, Marcin Mleczko wrote:
Now I'm changing the input string to (adding an extra '<'):
s = '<<html><head><title>Title</title>'
and evoking the last command again:
print re.match('<.*?>', s).group()
I would expect to get the same result
<html>
as I'm using the non-greedy pattern. What I get is
<<html>
Did I get the concept of non-greedy wrong or is this really a bug?
Definitely not a bug.
Your regex says:
"Match from the beginning of the string: less-than sign, then everything
up to the FIRST (non-greedy) greater-than sign."
So it matches the "<" at the beginning of the string, followed by the
"<html", followed by ">".
To get the result you are after, you could do this:
# Match two < signs, but only report from the second on
re.match('<(<.*?>)', s).group(1)
# Skip the first character
re.match('<.*?>', s[1:]).group()
# Don't match on < inside the <> tags
re.search('<[^<]*?>', s).group()
Notice that the last example must use re.search, not re.match,
because it does not match the beginning of the string.
By the way, you cannot parse general purpose HTML with a regular
expressions. You really should learn how to use Python's html
parsers, rather than trying to gerry-rig something that will do a
dodgy job.
--
Steven
_______________________________________________
Tutor maillist - Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor