> Huh? I thought it was settled. Read Terry Ready's latest message. Read > the bug report it points to (http://bugs.python.org/issue1693050), > especially the contribution from MvL. To paraphrase a remark by the > timbot, Martin reads Unicode tech reports so that we don't have to. > However if you are a doubter or have insomnia, read > http://unicode.org/reports/tr18/
To be fair to Python (and SRE), SRE predates TR#18 (IIRC) - atleast annex C was added somewhere between revision 6 and 9, i.e. in early 2004. Python's current definition of \w is a straight-forward extension of the historical \w definition (of Perl, I believe), which, unfortunately, fails to recognize some of the Unicode subtleties. In any case, the desired definition is very well available in Python today - one just has to define a character class that contains all characters that one thinks \w should contain, e.g. with the code below. While the regular expression source becomes very large, the compiled form will be fairly compact, and efficient in lookup. Regards, Martin # UTR#18 says \w is # \p{alpha}\p{gc=Mark}\p{digit}\p{gc=Connector_Punctuation} # # In turn, \p{alpha} is \p{Alphabetic}, which, in turn # is Lu+Ll+Lt+Lm+Lo+Nl + Other_Alphabetic # Other_Alphabetic can be ignored: it is a fixed list of # characters from Mn and Mc, which are included, anyway # # \p{digit} is \p{gc=Decimal_Number}, i.e. Nd # \p{gc=Mark} is all Mark category, i.e. Mc, Me, Mn # \p{gc=Connector_Punctuation} is Pc def make_w(): import unicodedata, sys w_chars = [] for i in range(sys.maxunicode): c = unichr(i) if unicodedata.category(c) in \ ('Lu','Ll','Lt','Lm','Lo','Nl','Nd', 'Mc','Me','Mn','Pc'): w_chars.append(c) return u'['+u''.join(w_chars)+u']' import re re.compile(make_w()) -- http://mail.python.org/mailman/listinfo/python-list