[Python-ideas] Re: Improve handling of Unicode quotes and hyphens

Andrew Barnert via Python-ideas Fri, 15 May 2020 18:18:27 -0700

On May 14, 2020, at 20:01, Stephen J. Turnbull 
<turnbull.stephen...@u.tsukuba.ac.jp> wrote:
> 
> Executive summary:
> 
> AFAICT, my guess at what's going on in the C tokenizer was exactly
> right.  It greedily consumes as many non-operator, non-whitespace
> characters as possible, then validates.


Well, it like like it’s not quite “non-operator, non-whitespace characters”, 
but rather “ASCII identifier or non-ASCII characters”:

>              (c >= 'a' && c <= 'z')\
>               || (c >= 'A' && c <= 'Z')\
>               || c == '_'\
>               || (c >= 128))

(That’s the initial char rule; the continuing char rule is similar but of 
course allows digits.)

So it won’t treat a $ or a ^G as potentially part of an identifier, so the 
caret will show up in the right place for one of those, but it will treat an 
emoji as potentially part of an identifier, so (if that emoji is immediately 
followed by legal identifier characters, ASCII or otherwise) the caret will 
show up too far to the right.

I’m still glad the Python tokenizer doesn’t do this (because, as I said, I’ve 
relied on the documented behavior in import hooks for playing around with 
Python, and they use the Python tokenizer), but that doesn’t matter for the C 
tokenizer, because its output is not public, it’s only seen by the parser. And 
I think you can prove that the error caret placement is the only thing that 
could be affected by this shortcut.[1] And if it makes the tokenizer faster, or 
just simpler to maintain, that could easily be worth it.

(At least until one of those periodic “Python should add this Unicode operator” 
proposals actually gets some traction, but I don’t see that as likely any time 
soon.)

—-

[1] Python only allows non-ASCII characters in identifiers, strings, and 
comments. Therefore, any string of characters that should be tokenized as a 
sequence of 1 ERRORTOKEN followed by 0 or more NAME and ERRORTOKEN tokens by 
the documented rule (and the Python code) will still give you a sequence of 1 
ERRORTOKEN followed by 0 or more NAME and ERRORTOKEN tokens by the C code, just 
not necessarily the same such sequence. And any such sequence will be parsed as 
a SyntaxError pointing at the end of the initial ERRORTOKEN. So, the caret 
might be somewhere else within that block of identifier and non-ASCII 
characters, but it will be somewhere within that block.
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/LPLKLECRRW2UEONMN6RAROU5HKKQC6XO/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Improve handling of Unicode quotes and hyphens

Reply via email to