On 01. 11. 21 13:17, Petr Viktorin wrote:
Hello,
Today, an attack called "Trojan source" was revealed, where a malicious contributor can use Unicode features (left-to-right text and homoglyphs) to code that, when shown in an editor, will look different from how a computer language parser will process it.
See https://trojansource.codes/, CVE-2021-42574 and CVE-2021-42694.

This is not a bug in Python.
As far as I know, the Python Security Response team reviewed the report and decided that it should be handled in code editors, diff viewers, repository frontends and similar software, rather than in the language.

I agree: in my opinion, the attack is similar to abusing any other "gotcha" where Python doesn't parse text as a non-expert human would. For example: `if a or b == 'yes'`, mutable default arguments, or a misleading typo.

Nevertheless, I did do a bit of research about similar gotchas in Python, and I'd like to publish a summary as an informational PEP, pasted below.


Thanks for the comments, everyone! I've updated the document and sent it to https://github.com/python/peps/pull/2129 A rendered version is at https://github.com/encukou/peps/blob/pep-0672/pep-0672.rst



Toshio Kuratomi wrote:
  `Unicode`_ is a system for handling all kinds of written language.
It aims to allow any character from any human natural language (as
well as a few characters which are not from natural languages) to be
used. Python code may consist of almost all valid Unicode characters.

Thanks! That's a nice summary; I condensed it a bit more and used it.
(I'm not joining the conversation on glyphs, characters, codepoints and encodings -- that's much too technical for this document. Using the specific technical terms unfortunately doesn't help understanding, so I use the vague ones like "character" and "letter".)


Jim J. Jewett wrote:
"The East Asian symbol for *ten* looks like a plus sign, so ``十= 10`` is a complete 
Python statement."

Normally, an identifier must begin with a letter, and numbers can only be used in the 
second and subsequent positions.  (XID_CONTINUE instead of XID_START)  The fact that some 
characters with numeric values are considered letters (in this case, category Lo, Other 
Letters) is a different problem than just looking visually confusable with "+", 
and it should probably be listed on its own.

I'm not a native speaker, but as I understand it, "十" is closer to a single-letter word than a single-digit number. It translates better as "ten" than "10". (And it appears in "十四", "fourteen", just like "four" appears in "fourteen".)


Patrick Schultz wrote:
- The Unicode consortium has a list of confusables, in case useful

Yup, and it's linked from the documents that describe how to use it. I link to those rather than just the list.
But thank you!


Terry Reedy wrote:
Bidirectional Text
------------------

Some scripts, such as Hebrew or Arabic, are written right-to-left.

[Suggested addition, subject to further revision.]

There are at least three levels of handling r2l chars: none, local (contiguous 
sequences are properly reversed), and extended (see below).  The handling 
depends on the display software and may depend on the quoting.  Tk and hence 
tkinter (and IDLE) text widgets do local handing.  Windows Notepad++ does local 
handling of unquoted code but extending handling of quoted text.  Windows 
Notepad currently does extended handling even without quotes.

I'd like to leave these details out of the document. The examples should render convincingly in browsers. The text should now describe the behavior even if you open it in an editor that does things differently, and acknowledge that such editors exist. (The behavior of specific editors/toolkits might well change in the future.)

For example, with ``encoding: unicode_escape``, characters like
quotes or braces can be hidden in an (f-)string, with many tools (syntax
highlighters, linters, etc.) considering them part of the string.
For example::

I don't see the connection between the text above and the example that follows.

    # For writing Japanese, you don't need an editor that supports
    # UTF-8 source encoding: unicode_escape sequences work just as well.
[etc]

Let me know if it's clear in the newest version, with this note:

Here, ``encoding: unicode_escape`` in the initial comment is an encoding
declaration. The ``unicode_escape`` encoding instructs Python to treat
``\u0027`` as a single quote (which can start/end a string), ``\u002c`` as
a comma (punctuator), etc.


Steven D'Aprano wrote:

Before the age of computers, most mechanical typewriters lacked the keys for the digits ``0`` and ``1``

I'm not sure that "most" is justifed here. One of the most popular typewriters in history, the Underwood #5 (from 1900 to 1920), lacked the 1 key but had a 0 distinct from O.

https://i1.wp.com/curiousasacathy.com/wp-content/uploads/2016/04/underwood-no-5-standard-typewriter-circa-1901.jpg

The Oliver 5 (1894 – 1928) had both a 0 and a 1, as did the 1895 Ford Typewriter. As did possibly the best selling typewriter in history, the IBM Selectric (introduced in 1961).

http://www.technocrazed.com/the-interesting-history-of-evolution-of-typewriters-photo-gallery

Perhaps you should say "many older mechanical typewriters"?



Ah, interesting! I only ever saw and read about ones that have a bunch of accented letters, leaving no space for dedicated 0/1 keys :)
My typewriter looks like this: https://imgur.com/a/J34gqVZ

Bidirectional Text
------------------

The section on bidirectional text is interesting, because reading it in my email client mutt, all the examples are left to right.

You might like to note that not all applications support bidirectional text.

It might be handled by your terminal rather than mutt.
I made the text work even if the examples don't render the way I'd like.


_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/OB6C54HCBESUTANUVOTTIUI7N2IYDPQV/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to