[Python-ideas] Re: Improve handling of Unicode quotes and hyphens

Joao S. O. Bueno Wed, 13 May 2020 06:16:23 -0700

Wow -  a lot going on this thread -
despite what to do seemingly really obvious:
of course showing which character triggered the error
along with a proper plain English phrase is enough.


The Fortran anecdote in the beginning of the thread is a
false analogy, since the program _will_ _not_ run with unexpected
results in Python, and, being in a PC or submitting code in a
testing pipeline it takes seconds to retry.

Fixing spurious characters back to their ASCII could be
placed in a popular code-formatting tool like Black.

   js
 -><-

On Wed, 13 May 2020 at 09:31, Richard Damon <rich...@damon-family.org> wrote:
>
> On 5/13/20 2:22 AM, Stephen J. Turnbull wrote:
> > MRAB writes:
> >  > On 2020-05-11 09:21, Chris Angelico wrote:
> >  > > On Mon, May 11, 2020 at 6:09 PM Steve Barnes <gadgetst...@live.co.uk> 
> > wrote:
> >
> >  > >> Actually, in the case of the “wrong quotes” it puts the pointer
> >  > >> under the character before the space character or at the end of
> >  > >> the line (if you have a fixed spacing font – worse if you don’t)
> >  > >> – it still doesn’t tell you which character is invalid.
> >
> >  > > This is actually a good point.
> >
> > But it's a different point:
> >
> >  > > Having an invalid character in an identifier shows the caret at
> >  > > the end of the identifier, regardless of where in the identifier
> >  > > the error is. That's something that could be improved on,
> >  > > regardless of the quote issue. There's a new parser on its way
> >  > > (PEP 617), so it'd be something to consider on that basis.
> >
> > This isn't a parsing problem as such.  I am not an expert on the
> > parser, but what's going is something like this: the parser
> > (tokenizer) sees the character "=" and expects an operator.  Next, it
> > sees something that is not "=" and not whitespace, so it expects a
> > literal or an identifier.  " “" is not parsable as the start of a
> > literal, so the parser consumes up to the next boundary character
> > (whitespace or operator).  Now it checks for the different types of
> > barewords: keywords and identifiers, and neither one works.
> >
> > Here's the critical point: identifier fails because the tokenizer
> > tries to match a sequence of Unicode word constitituents, and " “"
> > isn't one.  So it fails the sequence of non-whitespace characters, and
> > points to the end of the last thing it saw.
> But that is the problem, identifier fails too late, it should have seen
> at the start that the first character wasn't valid in an identifier, and
> failed THERE, pointing at the bad character. There shouldn't be a
> post-hoc test for bad characters in the identifier, it should be a
> pre-test in the tokenizer.
> >
> > So I see no reason why we need to transition to the new parser to fix
> > this.  (And the new parser (as of the last comment I saw from Guido)
> > probably doesn't help: he kept the tokenizer.)  We just need to make a
> > second pass over the invalid identifier and identify the invalid
> > characters it contains and their positions.
> There is no need to rescan/reparse, the tokenizer shouldn't treat
> illegal characters as possibly part of a token.
> >
> >  > I wouldn't object if the syntax error reported that, say, the wrong type
> >  > of quote was being used and included something like: Do you mean "?
> >  >
> >  > Wrong kind of quote (not "). Wrong kind of hyphen or minus (-). Etc.
> >
> > As a permanent resident of Japan, I DEMAND that YOU PERSONALLY
> > implement the SAME TEST for all the Japanese "full-width" operator
> > characters. :-)  (This is actually a very common user error, and it's
> > very hard to tell the difference by sight in many fonts, same as
> > directed quotes vs. ASCII quotes in English, but for the whole ASCII
> > repertoire.)  This could get really ridiculous.
> >
> > I think the suggestion that whatever test it is that identified the
> > "invalid character in identifier" defect be fixed to report both the
> > position of the first such character and the list of all such
> > characters is appropriate.
> >
> > The "wrong kind of quote" stuff belongs elsewhere, and in particular
> > in a linter.  Here's an quasi-algorithmic suggestion for that: use the
> > Unicode confusables list (and I think there are many properties such
> > as "related" and "paired" characters that can be indicative).  Haven't
> > looked at it in a while; it may not catch all the issues here.  But it
> > would be a good start, and quite comprehensive.  It might suggest
> > other things linters could be doing, too.
> >
> > Steve
> > _______________________________________________
> > Python-ideas mailing list -- python-ideas@python.org
> > To unsubscribe send an email to python-ideas-le...@python.org
> > https://mail.python.org/mailman3/lists/python-ideas.python.org/
> > Message archived at 
> > https://mail.python.org/archives/list/python-ideas@python.org/message/76BSFQ2AVRQM2BHFJZ4AE3KMAI7KX2N5/
> > Code of Conduct: http://python.org/psf/codeofconduct/
>
>
> --
> Richard Damon
> _______________________________________________
> Python-ideas mailing list -- python-ideas@python.org
> To unsubscribe send an email to python-ideas-le...@python.org
> https://mail.python.org/mailman3/lists/python-ideas.python.org/
> Message archived at 
> https://mail.python.org/archives/list/python-ideas@python.org/message/HONT7LKB745XHUGR76VQKHPXBCI5DJKA/
> Code of Conduct: http://python.org/psf/codeofconduct/
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/QNY2JZQFUSRIZ2UO2F565BZ23B4TCZZK/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Improve handling of Unicode quotes and hyphens

Reply via email to