[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

Petr Viktorin Wed, 03 Nov 2021 05:35:06 -0700

We seem to agree that this is work for linters. That's reasonable; I'dgeneralize it to "tools and policies". But even so, discussing what we'dexpect linters to do is on topic here.Perhaps we can even find ways for the language to support linters --type checking is also for external tools, but has language support.

For example: should the parser emit a lightweight audit event if itfinds a non-ASCII identifier? (See below for why ASCII is special.)

Or for encoding declarations?


On 03. 11. 21 6:26, Stephen J. Turnbull wrote:

Serhiy Storchaka writes:

  > All control characters except CR, LF, TAB and FF are banned outside
  > comments and string literals. I think it is worth to ban them in
  > comments and string literals too.

+1

  > > For homoglyphs/confusables, should there be a SyntaxWarning when an
  > > identifier looks like ASCII but isn't?
  >
  > It would virtually ban Cyrillic.

+1 (for the comment and for the implied -1 on SyntaxWarning, let's
keep the Cyrillic repertoire in Python!)


I don't think this would actually ban Cyrillic/Greek.

(My suggestion is not vanilla confusables detection; it might requirecareful reading: "should there be a [linter] warning when an identifierlooks like ASCII but isn't?")

I am not a native speaker, but I did try a bit to find an actualASCII-like word in a language that uses Cyrillic. I didn't succeed; Ithink they might be very rare.Even if there was such a word -- or a one-letter abbreviation used as avariable name -- it would be confusing to use. Removing the possibilityof confusion could *help* Cyrillic users. (I can't speak for them; thisis just a brainstorming idea.)


Steven adds:

Let's not enshrine as a language "feature" that non Western Europeanlanguages are dangerous second-class citizens.

That would be going too far, yes, but the fact is that non-Englishlanguages *are* second-class citizens. Code that uses Python keywordsand stdlib must use English, and possibly another language. It is themixing of languages that can be dangerous/confusing, not the languagesthemselves.


  > It is a work for linters,

+1

Aside from the reasons Serhiy presents, I'd rather not tie
this kind of rather ambiguous improvement in Unicode handling to the
release cycle.

It might be worth having a pep9999 module/script in Python (perhaps
more likely, PyPI but maintained by whoever does the work to make
these improvements + Petr or somebody Petr trusts to do it), that
lints scripts specifically for confusables and other issues.


If I have any say in it, the name definitely won't include a PEP number ;)
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/LB4O3YVDNVVNLYPMNH236QXGGUYG4BVI/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

Reply via email to