Wow - a lot going on this thread - despite what to do seemingly really obvious: of course showing which character triggered the error along with a proper plain English phrase is enough.
The Fortran anecdote in the beginning of the thread is a false analogy, since the program _will_ _not_ run with unexpected results in Python, and, being in a PC or submitting code in a testing pipeline it takes seconds to retry. Fixing spurious characters back to their ASCII could be placed in a popular code-formatting tool like Black. js -><- On Wed, 13 May 2020 at 09:31, Richard Damon <rich...@damon-family.org> wrote: > > On 5/13/20 2:22 AM, Stephen J. Turnbull wrote: > > MRAB writes: > > > On 2020-05-11 09:21, Chris Angelico wrote: > > > > On Mon, May 11, 2020 at 6:09 PM Steve Barnes <gadgetst...@live.co.uk> > > wrote: > > > > > >> Actually, in the case of the “wrong quotes” it puts the pointer > > > >> under the character before the space character or at the end of > > > >> the line (if you have a fixed spacing font – worse if you don’t) > > > >> – it still doesn’t tell you which character is invalid. > > > > > > This is actually a good point. > > > > But it's a different point: > > > > > > Having an invalid character in an identifier shows the caret at > > > > the end of the identifier, regardless of where in the identifier > > > > the error is. That's something that could be improved on, > > > > regardless of the quote issue. There's a new parser on its way > > > > (PEP 617), so it'd be something to consider on that basis. > > > > This isn't a parsing problem as such. I am not an expert on the > > parser, but what's going is something like this: the parser > > (tokenizer) sees the character "=" and expects an operator. Next, it > > sees something that is not "=" and not whitespace, so it expects a > > literal or an identifier. " “" is not parsable as the start of a > > literal, so the parser consumes up to the next boundary character > > (whitespace or operator). Now it checks for the different types of > > barewords: keywords and identifiers, and neither one works. > > > > Here's the critical point: identifier fails because the tokenizer > > tries to match a sequence of Unicode word constitituents, and " “" > > isn't one. So it fails the sequence of non-whitespace characters, and > > points to the end of the last thing it saw. > But that is the problem, identifier fails too late, it should have seen > at the start that the first character wasn't valid in an identifier, and > failed THERE, pointing at the bad character. There shouldn't be a > post-hoc test for bad characters in the identifier, it should be a > pre-test in the tokenizer. > > > > So I see no reason why we need to transition to the new parser to fix > > this. (And the new parser (as of the last comment I saw from Guido) > > probably doesn't help: he kept the tokenizer.) We just need to make a > > second pass over the invalid identifier and identify the invalid > > characters it contains and their positions. > There is no need to rescan/reparse, the tokenizer shouldn't treat > illegal characters as possibly part of a token. > > > > > I wouldn't object if the syntax error reported that, say, the wrong type > > > of quote was being used and included something like: Do you mean "? > > > > > > Wrong kind of quote (not "). Wrong kind of hyphen or minus (-). Etc. > > > > As a permanent resident of Japan, I DEMAND that YOU PERSONALLY > > implement the SAME TEST for all the Japanese "full-width" operator > > characters. :-) (This is actually a very common user error, and it's > > very hard to tell the difference by sight in many fonts, same as > > directed quotes vs. ASCII quotes in English, but for the whole ASCII > > repertoire.) This could get really ridiculous. > > > > I think the suggestion that whatever test it is that identified the > > "invalid character in identifier" defect be fixed to report both the > > position of the first such character and the list of all such > > characters is appropriate. > > > > The "wrong kind of quote" stuff belongs elsewhere, and in particular > > in a linter. Here's an quasi-algorithmic suggestion for that: use the > > Unicode confusables list (and I think there are many properties such > > as "related" and "paired" characters that can be indicative). Haven't > > looked at it in a while; it may not catch all the issues here. But it > > would be a good start, and quite comprehensive. It might suggest > > other things linters could be doing, too. > > > > Steve > > _______________________________________________ > > Python-ideas mailing list -- python-ideas@python.org > > To unsubscribe send an email to python-ideas-le...@python.org > > https://mail.python.org/mailman3/lists/python-ideas.python.org/ > > Message archived at > > https://mail.python.org/archives/list/python-ideas@python.org/message/76BSFQ2AVRQM2BHFJZ4AE3KMAI7KX2N5/ > > Code of Conduct: http://python.org/psf/codeofconduct/ > > > -- > Richard Damon > _______________________________________________ > Python-ideas mailing list -- python-ideas@python.org > To unsubscribe send an email to python-ideas-le...@python.org > https://mail.python.org/mailman3/lists/python-ideas.python.org/ > Message archived at > https://mail.python.org/archives/list/python-ideas@python.org/message/HONT7LKB745XHUGR76VQKHPXBCI5DJKA/ > Code of Conduct: http://python.org/psf/codeofconduct/ _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-le...@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/QNY2JZQFUSRIZ2UO2F565BZ23B4TCZZK/ Code of Conduct: http://python.org/psf/codeofconduct/