[Python-ideas] Re: Improve handling of Unicode quotes and hyphens

Andrew Barnert via Python-ideas Wed, 13 May 2020 11:21:31 -0700

On May 13, 2020, at 05:31, Richard Damon <rich...@damon-family.org> wrote:
> 
> On 5/13/20 2:22 AM, Stephen J. Turnbull wrote:
>> MRAB writes:
>>> 
>>> This isn't a parsing problem as such.  I am not an expert on the
>>> parser, but what's going is something like this: the parser
>>> (tokenizer) sees the character "=" and expects an operator.  Next, it
>>> sees something that is not "=" and not whitespace, so it expects a
>>> literal or an identifier.  " “" is not parsable as the start of a
>>> literal, so the parser consumes up to the next boundary character
>>> (whitespace or operator).  Now it checks for the different types of
>>> barewords: keywords and identifiers, and neither one works.
>>> 
>>> Here's the critical point: identifier fails because the tokenizer
>>> tries to match a sequence of Unicode word constitituents, and " “"
>>> isn't one.  So it fails the sequence of non-whitespace characters, and
>>> points to the end of the last thing it saw.
>> But that is the problem, identifier fails too late, it should have seen
>> at the start that the first character wasn't valid in an identifier, and
>> failed THERE, pointing at the bad character. There shouldn't be a
>> post-hoc test for bad characters in the identifier, it should be a
>> pre-test in the tokenizer.
>> 
>> So I see no reason why we need to transition to the new parser to fix
>> this.  (And the new parser (as of the last comment I saw from Guido)
>> probably doesn't help: he kept the tokenizer.)  We just need to make a
>> second pass over the invalid identifier and identify the invalid
>> characters it contains and their positions.
> There is no need to rescan/reparse, the tokenizer shouldn't treat
> illegal characters as possibly part of a token.


Isn’t this what already happens?

    >>> import tokenize, io
    >>> def tok(s): return 
list(tokenize.tokenize(io.BytesIO(x.encode()).readline))
    >>> tok('spam(“Abc”)')

When I run this in 3.7, the fourth token is an ERRORTOKEN with string ”, then 
there’s a NAME with Abc, then another ERRORTOKEN with “.

And reading the Lexical Analysis chapter of the docs, this seems correct. The 
smart quote is not a possible xid_start, or any other start of any token 
terminal, so it should immediately fail as an error.(The fact that the 
tokenizer eats it, generates an ERRORTOKEN, and then lexes the Abc as a NAME, 
rather than throwing an exception or otherwise punting, is a pretty nice 
error-recovery attempt, and seems perfectly reasonable.)

Is that not true for the internal C tokenizer? Or is it true, but the parser or 
the error generating code isn’t taking advantage of it?

(By the way. I’m pretty sure this behavior isn’t specific to 3.7, but has been 
that way back into the mists of whenever you could first write old-style import 
hooks, even up to the way error recovery works. I’ve taken advantage of this 
behavior in experimenting with new syntax. If your new syntax is not just 
unambiguous at the parser level, but even at the lexical level, you can just 
scan the token stream for your matching ERRORTOKEN.)

_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/OZ5N3NJIGQCO7Q645IDX4IWA45GAMEI6/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Improve handling of Unicode quotes and hyphens

Reply via email to