goog cheng wrote: > Hi, I got this problem : > > #!python > # -*- coding: utf-8 -*- > import re > > p = re.compile(ur'\bc123\b') > print '**',p.search('no class c123 at all').group() > > p = re.compile(ur'\b\u7a0b\u6770\b') > print ur'\u7a0b\u6770' > print '****',p.search(' 程杰 abc'.decode('utf8')) > > why the \b boundary can't match the word '程杰'
You need to provide the UNICODE flag: >>> re.compile(ur"\b程杰\b").search(u" 程杰 abc") >>> re.compile(ur"\b程杰\b", re.UNICODE).search(u" 程杰 abc") <_sre.SRE_Match object at 0x7f0beb325f38> See http://docs.python.org/library/re.html """ Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string ... \w When the LOCALE and UNICODE flags are not specified, matches any alphanumeric character and the underscore; this is equivalent to the set [a- zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus whatever characters are defined as alphanumeric for the current locale. If UNICODE is set, this will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database. """ _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor