[issue40593] Improve error reporting for invalid character in source code

2020-05-12 Thread Serhiy Storchaka


Change by Serhiy Storchaka :


--
resolution:  -> fixed
stage: patch review -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40593] Improve error reporting for invalid character in source code

2020-05-12 Thread Serhiy Storchaka


Serhiy Storchaka  added the comment:


New changeset 74ea6b5a7501fb393cd567fb21998d0bfeeb267c by Serhiy Storchaka in 
branch 'master':
bpo-40593: Improve syntax errors for invalid characters in source code. 
(GH-20033)
https://github.com/python/cpython/commit/74ea6b5a7501fb393cd567fb21998d0bfeeb267c


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40593] Improve error reporting for invalid character in source code

2020-05-11 Thread Serhiy Storchaka


Change by Serhiy Storchaka :


--
keywords: +patch
pull_requests: +19342
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/20033

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40593] Improve error reporting for invalid character in source code

2020-05-11 Thread Serhiy Storchaka

New submission from Serhiy Storchaka :

Currently you get SyntaxError with message "invalid character in identifier" in 
two cases:

1. The source code contains some non-ASCII non-identifier character. Usually it 
happens when you copy code from internet page or PDF file which was "improved" 
by some enhachaizer which replaces spaces with non-breacking  spaces, ASCII 
minus with a dash or Unicode minus, ASCII quotes with fancy Unicode quotes. 
They do not look like a part of identifier at all. The error message also does 
not say what character is invalid, and it is hard to find the culprit because 
they look too similar to correct characters (especially with some monospace 
fonts).

See 
https://mail.python.org/archives/list/python-id...@python.org/thread/ILMNJ46EAL4ENYK7LLDLGIMYQKZAMMWU/
 for discussion.

2. Other case is very special -- when the source code contains the declaration 
for the utf-8 encoding followed by non-UTF-8 bytes sequences. It is rarely 
happen in real world.

The proposed PR improves errors for these cases.

>>> print(123—45)
  File "", line 1
print(123—45)
 ^
SyntaxError: invalid character '—' (U+2014)

* The error message no longer contains misleading "in identifier".

* The error message contains the invalid character, literal and its hexcode.

* The caret points on the invalid character. Previously it pointed on the last 
non-ascii or non-alphabetical character followed the invalid character (5 in 
the above example).

* For the special case of non-decodable UTF-8 sequence the syntax error message 
is more informative: "(unicode error) 'utf-8' codec can't decode byte 0xff 
...". Although this case needs further improvements.

--
components: Interpreter Core
messages: 368622
nosy: serhiy.storchaka
priority: normal
severity: normal
status: open
title: Improve error reporting for invalid character in source code
type: enhancement
versions: Python 3.9

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com