New submission from David Lord:

This came up while writing a regex to match characters that are valid in Python 
identifiers for Jinja. https://github.com/pallets/jinja/pull/731 `\w` matches 
all valid identifier characters except for 4 special cases:

import unicodedata
import re
import sys

cre = re.compile(r'\w')

for cp in range(sys.maxunicode + 1):
    s = chr(cp)

    if s.isidentifier() and not cre.match(s):
        print(hex(cp), unicodedata.name(s))

0x1885 MONGOLIAN LETTER ALI GALI BALUDA
0x1886 MONGOLIAN LETTER ALI GALI THREE BALUDA
0x2118 SCRIPT CAPITAL P
0x212e ESTIMATED SYMBOL

Python < 3.6 matches the two Mongolian characters, not sure why 3.6 stopped 
matching them.

For our case, we just added them to a character set, 
`[\w\u1885\u1886\u2118\u212e]`.

It can cause unexpected behavior when using `\b`, since that's defined as the 
transition from `\w` to `\W` and those 4 characters aren't in `\w`. 
`re.match(r'\b[\w\u212e', '℮')` fails to match.

----------
components: Regular Expressions, Unicode
messages: 297603
nosy: davidism, ezio.melotti, haypo, mrabarnett
priority: normal
severity: normal
status: open
title: re \w does not match some valid Unicode characters
type: behavior
versions: Python 3.3, Python 3.4, Python 3.5, Python 3.6, Python 3.7

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30838>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to