New submission from David Lord: This came up while writing a regex to match characters that are valid in Python identifiers for Jinja. https://github.com/pallets/jinja/pull/731 `\w` matches all valid identifier characters except for 4 special cases:
import unicodedata import re import sys cre = re.compile(r'\w') for cp in range(sys.maxunicode + 1): s = chr(cp) if s.isidentifier() and not cre.match(s): print(hex(cp), unicodedata.name(s)) 0x1885 MONGOLIAN LETTER ALI GALI BALUDA 0x1886 MONGOLIAN LETTER ALI GALI THREE BALUDA 0x2118 SCRIPT CAPITAL P 0x212e ESTIMATED SYMBOL Python < 3.6 matches the two Mongolian characters, not sure why 3.6 stopped matching them. For our case, we just added them to a character set, `[\w\u1885\u1886\u2118\u212e]`. It can cause unexpected behavior when using `\b`, since that's defined as the transition from `\w` to `\W` and those 4 characters aren't in `\w`. `re.match(r'\b[\w\u212e', '℮')` fails to match. ---------- components: Regular Expressions, Unicode messages: 297603 nosy: davidism, ezio.melotti, haypo, mrabarnett priority: normal severity: normal status: open title: re \w does not match some valid Unicode characters type: behavior versions: Python 3.3, Python 3.4, Python 3.5, Python 3.6, Python 3.7 _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue30838> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com