[issue1693050] \w not helpful for non-Roman scripts

STINNER Victor Wed, 29 May 2013 13:34:46 -0700

STINNER Victor added the comment:

Let see Modules/_sre.c:


#define SRE_UNI_IS_ALNUM(ch) Py_UNICODE_ISALNUM(ch)
#define SRE_UNI_IS_WORD(ch) (SRE_UNI_IS_ALNUM(ch) || (ch) == '_')

>>> [ch.isalpha() for ch in '\u0939\u093f\u0928\u094d\u0926\u0940']
[True, False, True, False, True, False]
>>> import unicodedata
>>> [unicodedata.category(ch) for ch in '\u0939\u093f\u0928\u094d\u0926\u0940']
['Lo', 'Mc', 'Lo', 'Mn', 'Lo', 'Mc']

So the matching ends at U+093f because its category is a "spacing combining" 
(Mc), which is part of the Mark category, where the re module expects an 
alphanumeric character.

msg76557:

"""
Unicode TR#18 defines \w as a shorthand for

\p{alpha}
\p{gc=Mark}
\p{digit}
\p{gc=Connector_Punctuation}
"""

So if we want to respect this standard, the re module needs to be modified to 
accept other Unicode categories.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue1693050>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue1693050] \w not helpful for non-Roman scripts

Reply via email to