On 01. 11. 21 13:17, Petr Viktorin wrote:
Hello,
Today, an attack called "Trojan source" was revealed, where a malicious
contributor can use Unicode features (left-to-right text and homoglyphs)
to code that, when shown in an editor, will look different from how a
computer language parser will process it.
See https://trojansource.codes/, CVE-2021-42574 and CVE-2021-42694.
This is not a bug in Python.
As far as I know, the Python Security Response team reviewed the report
and decided that it should be handled in code editors, diff viewers,
repository frontends and similar software, rather than in the language.
I agree: in my opinion, the attack is similar to abusing any other
"gotcha" where Python doesn't parse text as a non-expert human would.
For example: `if a or b == 'yes'`, mutable default arguments, or a
misleading typo.
Nevertheless, I did do a bit of research about similar gotchas in
Python, and I'd like to publish a summary as an informational PEP,
pasted below.
Thanks for the comments, everyone! I've updated the document and sent it
to https://github.com/python/peps/pull/2129
A rendered version is at
https://github.com/encukou/peps/blob/pep-0672/pep-0672.rst
Toshio Kuratomi wrote:
`Unicode`_ is a system for handling all kinds of written language.
It aims to allow any character from any human natural language (as
well as a few characters which are not from natural languages) to be
used. Python code may consist of almost all valid Unicode characters.
Thanks! That's a nice summary; I condensed it a bit more and used it.
(I'm not joining the conversation on glyphs, characters, codepoints and
encodings -- that's much too technical for this document. Using the
specific technical terms unfortunately doesn't help understanding, so I
use the vague ones like "character" and "letter".)
Jim J. Jewett wrote:
"The East Asian symbol for *ten* looks like a plus sign, so ``十= 10`` is a complete
Python statement."
Normally, an identifier must begin with a letter, and numbers can only be used in the
second and subsequent positions. (XID_CONTINUE instead of XID_START) The fact that some
characters with numeric values are considered letters (in this case, category Lo, Other
Letters) is a different problem than just looking visually confusable with "+",
and it should probably be listed on its own.
I'm not a native speaker, but as I understand it, "十" is closer to a
single-letter word than a single-digit number. It translates better as
"ten" than "10". (And it appears in "十四", "fourteen", just like "four"
appears in "fourteen".)
Patrick Schultz wrote:
- The Unicode consortium has a list of confusables, in case useful
Yup, and it's linked from the documents that describe how to use it. I
link to those rather than just the list.
But thank you!
Terry Reedy wrote:
Bidirectional Text
------------------
Some scripts, such as Hebrew or Arabic, are written right-to-left.
[Suggested addition, subject to further revision.]
There are at least three levels of handling r2l chars: none, local (contiguous
sequences are properly reversed), and extended (see below). The handling
depends on the display software and may depend on the quoting. Tk and hence
tkinter (and IDLE) text widgets do local handing. Windows Notepad++ does local
handling of unquoted code but extending handling of quoted text. Windows
Notepad currently does extended handling even without quotes.
I'd like to leave these details out of the document. The examples should
render convincingly in browsers. The text should now describe the
behavior even if you open it in an editor that does things differently,
and acknowledge that such editors exist. (The behavior of specific
editors/toolkits might well change in the future.)
For example, with ``encoding: unicode_escape``, characters like
quotes or braces can be hidden in an (f-)string, with many tools (syntax
highlighters, linters, etc.) considering them part of the string.
For example::
I don't see the connection between the text above and the example that follows.
# For writing Japanese, you don't need an editor that supports
# UTF-8 source encoding: unicode_escape sequences work just as well.
[etc]
Let me know if it's clear in the newest version, with this note:
Here, ``encoding: unicode_escape`` in the initial comment is an encoding
declaration. The ``unicode_escape`` encoding instructs Python to treat
``\u0027`` as a single quote (which can start/end a string), ``\u002c`` as
a comma (punctuator), etc.
Steven D'Aprano wrote:
Before the age of computers, most mechanical typewriters lacked the keys
for the digits ``0`` and ``1``
I'm not sure that "most" is justifed here. One of the most popular
typewriters in history, the Underwood #5 (from 1900 to 1920), lacked
the 1 key but had a 0 distinct from O.
https://i1.wp.com/curiousasacathy.com/wp-content/uploads/2016/04/underwood-no-5-standard-typewriter-circa-1901.jpg
The Oliver 5 (1894 – 1928) had both a 0 and a 1, as did the 1895 Ford
Typewriter. As did possibly the best selling typewriter in history, the
IBM Selectric (introduced in 1961).
http://www.technocrazed.com/the-interesting-history-of-evolution-of-typewriters-photo-gallery
Perhaps you should say "many older mechanical typewriters"?
Ah, interesting! I only ever saw and read about ones that have a bunch
of accented letters, leaving no space for dedicated 0/1 keys :)
My typewriter looks like this: https://imgur.com/a/J34gqVZ
Bidirectional Text
------------------
The section on bidirectional text is interesting, because reading it in
my email client mutt, all the examples are left to right.
You might like to note that not all applications support bidirectional
text.
It might be handled by your terminal rather than mutt.
I made the text work even if the examples don't render the way I'd like.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at
https://mail.python.org/archives/list/python-dev@python.org/message/OB6C54HCBESUTANUVOTTIUI7N2IYDPQV/
Code of Conduct: http://python.org/psf/codeofconduct/