andrew cooke wrote:
Is the third case here surprising to anyone else?  It doesn't make
sense to me...

Python 2.6.2 (r262:71600, Oct 24 2009, 03:15:21)
[GCC 4.4.1 [gcc-4_4-branch revision 150839]] on linux2
Type "help", "copyright", "credits" or "license" for more information.
from re import compile
p1 = compile('a\x62c')

'a\x62c' is a string literal which is the same as 'abc', so re.compile
receives the characters:

    abc

as the regex, which matches the string:

    abc

p1.match('abc')
<_sre.SRE_Match object at 0x7f4e8f93d578>
p2 = compile('a\\x62c')

'a\\x62c' is a string literal which represents the characters:

    a\x62c

so re.compile receives these characters as the regex.

The re module understands has its own set of escape sequences, most of
which are the same as Python's string escape sequences. The re module
treats \x62 like the string escape, ie it represents the character 'b',
so this regex is the same as:

    abc

p2.match('abc')
<_sre.SRE_Match object at 0x7f4e8f93d920>
p3 = compile('a\\\x62c')

'a\\\x62c' is a string literal which is the same as 'a\\bc', so
re.compile receives the characters:

    a\bc

as the regex.

The re module treats the \b in a regex as representing a word boundary,
unless it's in a character set, eg. [\b].

The regex will try to match a word boundary sandwiched between 2
letters, which can never happen.

p3.match('a\\bc')
p3.match('abc')
p3.match('a\\\x62c')

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to