[issue32987] tokenize.py parses unicode identifiers incorrectly

Serhiy Storchaka Wed, 14 Mar 2018 01:03:06 -0700

Serhiy Storchaka <[email protected]> added the comment:

This issue and issue12486 doesn't have any common except that both are related 
to the tokenize module.


There are two bugs: a too narrow definition of \w in the re module (see  
issue12731 and issue1693050) and a too narrow definition of Name in the 
tokenize module.


>>> allchars = list(map(chr, range(0x110000)))
>>> start = [c for c in allchars if c.isidentifier()]
>>> cont = [c for c in allchars if ('a'+c).isidentifier()]
>>> import re, regex, unicodedata

>>> for c in regex.findall(r'\W', ''.join(start)): print('%r  U+%04X  %s' % (c, 
>>> ord(c), unicodedata.name(c, '?')))
... 
'℘'  U+2118  SCRIPT CAPITAL P
'℮'  U+212E  ESTIMATED SYMBOL
>>> for c in regex.findall(r'\W', ''.join(cont)): print('%r  U+%04X  %s' % (c, 
>>> ord(c), unicodedata.name(c, '?')))
... 
'·'  U+00B7  MIDDLE DOT
'·'  U+0387  GREEK ANO TELEIA
'፩'  U+1369  ETHIOPIC DIGIT ONE
'፪'  U+136A  ETHIOPIC DIGIT TWO
'፫'  U+136B  ETHIOPIC DIGIT THREE
'፬'  U+136C  ETHIOPIC DIGIT FOUR
'፭'  U+136D  ETHIOPIC DIGIT FIVE
'፮'  U+136E  ETHIOPIC DIGIT SIX
'፯'  U+136F  ETHIOPIC DIGIT SEVEN
'፰'  U+1370  ETHIOPIC DIGIT EIGHT
'፱'  U+1371  ETHIOPIC DIGIT NINE
'᧚'  U+19DA  NEW TAI LUE THAM DIGIT ONE
'℘'  U+2118  SCRIPT CAPITAL P
'℮'  U+212E  ESTIMATED SYMBOL
>>> for c in re.findall(r'\W', ''.join(start)): print('%r  U+%04X  %s' % (c, 
>>> ord(c), unicodedata.name(c, '?')))
... 
'ᢅ'  U+1885  MONGOLIAN LETTER ALI GALI BALUDA
'ᢆ'  U+1886  MONGOLIAN LETTER ALI GALI THREE BALUDA
'℘'  U+2118  SCRIPT CAPITAL P
'℮'  U+212E  ESTIMATED SYMBOL
>>> for c in re.findall(r'\W', ''.join(cont)): print('%r  U+%04X  %s' % (c, 
>>> ord(c), unicodedata.name(c, '?')))
... 
'·'  U+00B7  MIDDLE DOT
'̀'  U+0300  COMBINING GRAVE ACCENT
'́'  U+0301  COMBINING ACUTE ACCENT
'̂'  U+0302  COMBINING CIRCUMFLEX ACCENT
'̃'  U+0303  COMBINING TILDE
...
[total 2177 characters]

The second bug can be solved by adding 14 more characters in the pattern for 
Name.

    Name = r'[\w\xb7\u0387\u1369-\u1371\u19da\u2118\u212e]+'

or

    Name = r'[\w\u2118\u212e][\w\xb7\u0387\u1369-\u1371\u19da\u2118\u212e]*'

But first the issue with \w should be resolved (if we don't want to add 2177 
characters).

The other solution is implementing property support in re (issue12734).

----------

_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue32987>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue32987] tokenize.py parses unicode identifiers incorrectly

Reply via email to