Andrew Barnert added the comment:

Ultimately, this is because the tokenizer works byte by byte instead of 
character by character, as far as possible. Since any byte >= 128 must be part 
of some non-ASCII character, and the only legal use for non-ASCII characters 
outside of quotes and comments is as part of an identifier, the tokenizer 
assumes (see the macros at the top of tokenizer.c, and the top of the again 
block in tok_get) that any byte >= 128 is part of an identifier, and then 
checks the whole string with PyUnicode_IsIdentifier at the end.

This actually gives a better error for more visible glyphs, especially ones 
that look letter-like but aren't in XID_Continue, but it is kind of weird for a 
few, like non-break space.

If this needs to be fixed, I think the simplest thing is to special-case 
things: if the first non-valid-identifier character is in category Z, set an 
error about invalid whitespace instead of invalid identifier character. (This 
would probably require adding a PyUnicode_CheckIdentifier that, instead of just 
returning 0 for failure as PyUnicode_IsIdentifier, returns -n for 
non-identifier character with code point n.)

----------
nosy: +abarnert

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue26152>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to