[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

Petr Viktorin Tue, 02 Nov 2021 07:06:37 -0700

On 01. 11. 21 13:17, Petr Viktorin wrote:

Hello,
Today, an attack called "Trojan source" was revealed, where a maliciouscontributor can use Unicode features (left-to-right text and homoglyphs)to code that, when shown in an editor, will look different from how acomputer language parser will process it.
See https://trojansource.codes/, CVE-2021-42574 and CVE-2021-42694.
This is not a bug in Python.
As far as I know, the Python Security Response team reviewed the reportand decided that it should be handled in code editors, diff viewers,repository frontends and similar software, rather than in the language.
I agree: in my opinion, the attack is similar to abusing any other"gotcha" where Python doesn't parse text as a non-expert human would.For example: `if a or b == 'yes'`, mutable default arguments, or amisleading typo.
Nevertheless, I did do a bit of research about similar gotchas inPython, and I'd like to publish a summary as an informational PEP,pasted below.

Thanks for the comments, everyone! I've updated the document and sent itto https://github.com/python/peps/pull/2129A rendered version is athttps://github.com/encukou/peps/blob/pep-0672/pep-0672.rst




Toshio Kuratomi wrote:

  `Unicode`_ is a system for handling all kinds of written language.
It aims to allow any character from any human natural language (as
well as a few characters which are not from natural languages) to be
used. Python code may consist of almost all valid Unicode characters.


Thanks! That's a nice summary; I condensed it a bit more and used it.

(I'm not joining the conversation on glyphs, characters, codepoints andencodings -- that's much too technical for this document. Using thespecific technical terms unfortunately doesn't help understanding, so Iuse the vague ones like "character" and "letter".)



Jim J. Jewett wrote:

"The East Asian symbol for *ten* looks like a plus sign, so ``十= 10`` is a complete 
Python statement."


Normally, an identifier must begin with a letter, and numbers can only be used in the 
second and subsequent positions.  (XID_CONTINUE instead of XID_START)  The fact that some 
characters with numeric values are considered letters (in this case, category Lo, Other 
Letters) is a different problem than just looking visually confusable with "+", 
and it should probably be listed on its own.

I'm not a native speaker, but as I understand it, "十" is closer to asingle-letter word than a single-digit number. It translates better as"ten" than "10". (And it appears in "十四", "fourteen", just like "four"appears in "fourteen".)



Patrick Schultz wrote:

- The Unicode consortium has a list of confusables, in case useful

Yup, and it's linked from the documents that describe how to use it. Ilink to those rather than just the list.

But thank you!


Terry Reedy wrote:

Bidirectional Text
------------------

Some scripts, such as Hebrew or Arabic, are written right-to-left.


[Suggested addition, subject to further revision.]

There are at least three levels of handling r2l chars: none, local (contiguous 
sequences are properly reversed), and extended (see below).  The handling 
depends on the display software and may depend on the quoting.  Tk and hence 
tkinter (and IDLE) text widgets do local handing.  Windows Notepad++ does local 
handling of unquoted code but extending handling of quoted text.  Windows 
Notepad currently does extended handling even without quotes.

I'd like to leave these details out of the document. The examples shouldrender convincingly in browsers. The text should now describe thebehavior even if you open it in an editor that does things differently,and acknowledge that such editors exist. (The behavior of specificeditors/toolkits might well change in the future.)

For example, with ``encoding: unicode_escape``, characters like
quotes or braces can be hidden in an (f-)string, with many tools (syntax
highlighters, linters, etc.) considering them part of the string.
For example::


I don't see the connection between the text above and the example that follows.

    # For writing Japanese, you don't need an editor that supports
    # UTF-8 source encoding: unicode_escape sequences work just as well.

[etc]


Let me know if it's clear in the newest version, with this note:

Here, ``encoding: unicode_escape`` in the initial comment is an encoding
declaration. The ``unicode_escape`` encoding instructs Python to treat
``\u0027`` as a single quote (which can start/end a string), ``\u002c`` as
a comma (punctuator), etc.



Steven D'Aprano wrote:

Before the age of computers, most mechanical typewriters lacked the keysfor the digits ``0`` and ``1``
I'm not sure that "most" is justifed here. One of the most populartypewriters in history, the Underwood #5 (from 1900 to 1920), lackedthe 1 key but had a 0 distinct from O.
https://i1.wp.com/curiousasacathy.com/wp-content/uploads/2016/04/underwood-no-5-standard-typewriter-circa-1901.jpg
The Oliver 5 (1894 – 1928) had both a 0 and a 1, as did the 1895 FordTypewriter. As did possibly the best selling typewriter in history, theIBM Selectric (introduced in 1961).
http://www.technocrazed.com/the-interesting-history-of-evolution-of-typewriters-photo-gallery

Perhaps you should say "many older mechanical typewriters"?

Ah, interesting! I only ever saw and read about ones that have a bunchof accented letters, leaving no space for dedicated 0/1 keys :)

My typewriter looks like this: https://imgur.com/a/J34gqVZ

Bidirectional Text
------------------
The section on bidirectional text is interesting, because reading it inmy email client mutt, all the examples are left to right.
You might like to note that not all applications support bidirectionaltext.


It might be handled by your terminal rather than mutt.
I made the text work even if the examples don't render the way I'd like.


_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/OB6C54HCBESUTANUVOTTIUI7N2IYDPQV/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

Reply via email to