Terry J. Reedy <[email protected]> added the comment:
I think the issues are slightly different. #12486 is about the awkwardness of
the API. This is about a false error after jumping through the hoops, which I
think Steve B did correctly.
Following the link, the Other_ID_Continue chars are
00B7 ; Other_ID_Continue # Po MIDDLE DOT
0387 ; Other_ID_Continue # Po GREEK ANO TELEIA
1369..1371 ; Other_ID_Continue # No [9] ETHIOPIC DIGIT ONE..ETHIOPIC DIGIT
NINE
19DA ; Other_ID_Continue # No NEW TAI LUE THAM DIGIT ONE
# Total code points: 12
The 2 Po chars fail, the 2 No chars work. After looking at the tokenize
module, I believe the problem is the re for Name is r'\w+' and the Po chars are
not seen as \w word characters.
>>> r = re.compile(r'\w+', re.U)
>>> re.match(r, 'ab\u0387cd')
<re.Match object; span=(0, 2), match='ab'>
I don't know if the bug is a too narrow definition of \w in the re module("most
characters that can be part of a word in any language, as well as numbers and
the underscore") or of Name in the tokenize module.
Before patching anything, I would like to know if the 2 Po Other chars are the
only 2 not matched by \w. Unless someone has done so already, at least a
sample of chars from each category included in the definition of 'identifier'
should be tested.
----------
nosy: +serhiy.storchaka
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue32987>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com