[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

2021-11-29 Thread Christopher Barker
On Mon, Nov 29, 2021 at 1:21 AM Steve Holden wrote: > It's interesting that the egalitarian wish to allow use of native > "alphabetics" has turned out to be such a viper's nest. > Indeed. However, is there no way to restrict identifiers at least to the alphabets of natural languages? Maybe it

[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

2021-11-29 Thread Steve Holden
On Mon, Nov 15, 2021 at 8:42 AM Kyle Stanley wrote: > On Sat, Nov 13, 2021 at 5:04 PM wrote: > >> >> >> def 횑퓮햑풍표(): >> > [... Python code it's easy to believe isn't grammatical ...] > return ₛ >> > > 0_o color me impressed, I did not think that would be legal syntax. Would > be

[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

2021-11-18 Thread Eryk Sun
On 11/13/21, Terry Reedy wrote: > On 11/13/2021 4:35 PM, pt...@austin.rr.com wrote: >> >> _퓟Ⅼ햠홲험ℋ풪Lᴰ푬핽﹏핷피헡 = 12 >> >> def _픰ʰ퓸ʳ핥홚푛(픰, p푟픢fi햝핝횎푛, sᵤ푓헳헂푥헹ₑ횗): >> >> ˢ헸i헽 = 퐥e혯(햘) - pr횎햋퐢x헅ᵉ퓷 - 풔홪ffi혅헹홚ₙ >> >> if ski혱 > _퐏헟햠혊홴H핺L핯홀혙﹏L픈풩: >> >> 혴 = '%s[%d chars]%s' % (홨[:혱퐫핖푓핚xℓ풆핟], ₛ횔풊p,

[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

2021-11-16 Thread Stephen J. Turnbull
Executive summary: I guess the bottom line is that I'm sympathetic to both the NFC and NFKC positions. I think that wetware is such that people will go to the trouble of picking out a letter-like symbol from a palette rarely, and in my environment that's not going to happen at all because I use

[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

2021-11-16 Thread Jim J. Jewett
Steven D'Aprano wrote: > I think > that many editors in common use don't support bidirectional text, or at > least the ones I use don't seem to support it fully or correctly. ... > But, if there is a concrete threat beyond "it looks weird", that it > another issue. Based on the original post

[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

2021-11-16 Thread Jim J. Jewett
Stephen J. Turnbull wrote: > Christopher Barker writes: > > For example, in writing math we often use different scripts to mean > > different things (e.g. TeX's Blackboard Bold). So if I were to use > > some of the Unicode Mathematical Alphanumeric Symbols, I wouldn't > > want them to get

[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

2021-11-16 Thread Jim J. Jewett
Compatibility variants can look different, but they can also look identical. Allowing any non-ASCII characters was worrisome because of the security implications of confusables. Squashing compatibility characters seemed the more conservative choice at the time. Stestagg's example: е =

[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

2021-11-15 Thread Chris Angelico
On Tue, Nov 16, 2021 at 12:13 PM Steven D'Aprano wrote: > > On Mon, Nov 15, 2021 at 10:43:12PM +1100, Chris Angelico wrote: > > > The problems here are not Python's, they are code reviewers', and that > > means they're really attacks against the code review tools. > > I think that's a bit strong.

[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

2021-11-15 Thread Steven D'Aprano
On Mon, Nov 15, 2021 at 10:43:12PM +1100, Chris Angelico wrote: > The problems here are not Python's, they are code reviewers', and that > means they're really attacks against the code review tools. I think that's a bit strong. Boucher and Anderson's paper describes multiple kinds of

[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

2021-11-15 Thread Steven D'Aprano
On Mon, Nov 15, 2021 at 03:20:26PM +0400, Abdur-Rahmaan Janhangeer wrote: > Well, it's not so obvious. From Ross Anderson and Nicholas Boucher > src: https://trojansource.codes/trojan-source.pdf Thanks for the link. But it discusses a whole range of Unicode attacks, and the specific attack you

[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

2021-11-15 Thread Steven D'Aprano
On Mon, Nov 15, 2021 at 12:28:01PM -0500, Terry Reedy wrote: > On 11/15/2021 5:45 AM, Steven D'Aprano wrote: > > >In another thread, Serhiy already suggested we ban invisible control > >characters (other than whitespace) in comments and strings. > > He said in string *literals*. One would put

[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

2021-11-15 Thread Abdur-Rahmaan Janhangeer
> GitHub specifically flags it as a possible exploit in a couple of cases, but also syntax highlights the return keyword appropriately. My guess is that Github did patch it afterwards as the paper does list Github as vulnerable > Uhhm. "weird unicode stuffs"? Please clarify. Wriggly texts

[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

2021-11-15 Thread Terry Reedy
On 11/15/2021 5:45 AM, Steven D'Aprano wrote: In another thread, Serhiy already suggested we ban invisible control characters (other than whitespace) in comments and strings. He said in string *literals*. One would put them in stromgs by using visible escape sequences. >>> '\033' is

[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

2021-11-15 Thread Stephen J. Turnbull
Abdur-Rahmaan Janhangeer writes: > As a programmer, i don't want a language which bans unicode stuffs. But that's what Unicode says should be done (see below). > If there's something that should be fixed, it's the unicode standard, Unicode is not going to get "fixed". Most features are

[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

2021-11-15 Thread Marc-Andre Lemburg
On 15.11.2021 12:36, Steven D'Aprano wrote: > On Sun, Nov 14, 2021 at 10:12:39PM -0800, Christopher Barker wrote: > >> I am, however, surprised and disappointed by the NKFC normalization. >> >> For example, in writing math we often use different scripts to mean >> different things (e.g. TeX's

[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

2021-11-15 Thread Chris Angelico
On Mon, Nov 15, 2021 at 10:22 PM Abdur-Rahmaan Janhangeer wrote: > > Greetings, > > > > Now what happens? where do you go from there to a vunerability or > backdoor? I think it might be a bit obvious that there is something > funny going on if I see: > > if (user.admin == "root" and

[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

2021-11-15 Thread Steven D'Aprano
On Sun, Nov 14, 2021 at 10:12:39PM -0800, Christopher Barker wrote: > I am, however, surprised and disappointed by the NKFC normalization. > > For example, in writing math we often use different scripts to mean > different things (e.g. TeX's Blackboard Bold). So if I were to use > some of the

[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

2021-11-15 Thread Abdur-Rahmaan Janhangeer
Greetings, > Now what happens? where do you go from there to a vunerability or backdoor? I think it might be a bit obvious that there is something funny going on if I see: if (user.admin == "root" and check_password_securely() or user.admin == "root" # Second string

[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

2021-11-15 Thread Steven D'Aprano
On Mon, Nov 15, 2021 at 12:33:54PM +0400, Abdur-Rahmaan Janhangeer wrote: > Yet another issue is adding vulnerabilities in plain sight. > Human code reviewers will see this: > > if user.admin == "something": > > Static analysers will see > > if user.admin == "something": Okay, you have a

[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

2021-11-15 Thread Petr Viktorin
On 15. 11. 21 9:25, Stephen J. Turnbull wrote: Christopher Barker writes: > Would a proposal to switch the normalization to NFC only have any hope of > being accepted? Hope, yes. Counting you, it's been proposed twice. :-) I don't know whether it would get through. We know this won't

[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

2021-11-15 Thread Kyle Stanley
On Sat, Nov 13, 2021 at 5:04 PM wrote: > > > def 횑퓮햑풍표(): > > try: > > 픥e헅핝횘︴ = "Hello" > > 함픬r퓵ᵈ﹎ = "World" > > ᵖ햗퐢혯퓽(f"{헵e퓵픩º_}, {햜ₒ풓lⅆ︴}!") > > except 퓣핪ᵖe햤헿ᵣ햔횛 as ⅇ헑c: > > 풑rℹₙₜ("failed: {}".핗헼ʳᵐªt(ᵉ퐱퓬)) > > > > if _︴ⁿ퓪푚핖__ == "__main__": > >

[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

2021-11-15 Thread Abdur-Rahmaan Janhangeer
Well, Yet another issue is adding vulnerabilities in plain sight. Human code reviewers will see this: if user.admin == "something": Static analysers will see if user.admin == "something": but will not flag it as it's up to the user to verify the logic of things and as such soft authors can

[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

2021-11-15 Thread Stephen J. Turnbull
Christopher Barker writes: > Would a proposal to switch the normalization to NFC only have any hope of > being accepted? Hope, yes. Counting you, it's been proposed twice. :-) I don't know whether it would get through. We know this won't affect the stdlib, since that's restricted to ASCII.

[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

2021-11-14 Thread Christopher Barker
On Sun, Nov 14, 2021 at 4:53 PM Steven D'Aprano wrote: > Out of all the approximately thousand bazillion ways to write obfuscated > Python code, which may or may not be malicious, why are Unicode > confusables worth this level of angst and concern? > I for one am not full of angst nor

[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

2021-11-14 Thread Steven D'Aprano
Out of all the approximately thousand bazillion ways to write obfuscated Python code, which may or may not be malicious, why are Unicode confusables worth this level of angst and concern? I looked up "Unicode homoglyph" on CVE, and found a grand total of seven hits:

[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

2021-11-14 Thread Richard Damon
On 11/14/21 2:36 PM, David Mertz, Ph.D. wrote: On Sun, Nov 14, 2021, 2:14 PM Christopher Barker It's probably to deal with "é" vs "é", i.e. "\N{LATIN SMALL LETTER E}\N{COMBINING ACUTE ACCENT}" vs "\N{LATIN SMALL LETTER E WITH ACUTE}", which are different

[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

2021-11-14 Thread David Mertz, Ph.D.
On Sun, Nov 14, 2021, 2:14 PM Christopher Barker > It's probably to deal with "é" vs "é", i.e. "\N{LATIN SMALL LETTER >> E}\N{COMBINING ACUTE ACCENT}" vs "\N{LATIN SMALL LETTER E WITH ACUTE}", >> which are different ways of writing the same thing. >> > > Why does someone that wants to use, .e.g.

[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

2021-11-14 Thread Richard Damon
On 11/14/21 2:07 PM, Christopher Barker wrote: Why does someone that wants to use, .e.g. "é" in an identifier have to be able to represent it two different ways in a code file? The issue here is that fundamentally, some editors will produce composed characters and some decomposed characters

[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

2021-11-14 Thread Daniel Pope
On Sun, 14 Nov 2021, 19:07 Christopher Barker, wrote: > On Sun, Nov 14, 2021 at 10:27 AM MRAB wrote: > >> Unfortunately, it goes too far, because it's unlikely that we want "ᵖ" >> ("\N{MODIFIER LETTER SMALL P}') to be equivalent to "P" ("\N{LATIN >> CAPITAL LETTER P}". >> > > Is it possible to

[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

2021-11-14 Thread Christopher Barker
On Sun, Nov 14, 2021 at 10:27 AM MRAB wrote: > > So why does Python apply NFKC normalization to variable names?? > It's probably to deal with "é" vs "é", i.e. "\N{LATIN SMALL LETTER > E}\N{COMBINING ACUTE ACCENT}" vs "\N{LATIN SMALL LETTER E WITH ACUTE}", > which are different ways of

[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

2021-11-14 Thread Alex Martelli via Python-Dev
Indeed, normative annex https://www.unicode.org/reports/tr31/tr31-35.html section 5 says: "if the programming language has case-sensitive identifiers, then Normalization Form C is appropriate" (vs NFKC for a language with case-insensitive identifiers) so to follow the standard we should have used

[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

2021-11-14 Thread MRAB
On 2021-11-14 17:17, Christopher Barker wrote: On Sat, Nov 13, 2021 at 2:03 PM > wrote: def 횑퓮햑풍표(): __     try: 픥e헅핝횘︴ = "Hello" 함픬r퓵ᵈ﹎ = "World"     ᵖ햗퐢혯퓽(f"{헵e퓵픩º_}, {햜ₒ풓lⅆ︴}!")     except 퓣핪ᵖe햤헿ᵣ햔횛 as

[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

2021-11-14 Thread Jim J. Jewett
ptmcg@austin.rr.com wrote: > ... add a cautionary section on homoglyphs, specifically citing > “A” (LATIN CAPITAL LETTER A) and “Α” (GREEK CAPITAL LETTER ALPHA) > as an example problem pair. There is a unicode tech report about confusables, but it is never clear where to stop. Are I (upper

[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

2021-11-14 Thread Christopher Barker
On Sat, Nov 13, 2021 at 2:03 PM wrote: > def 횑퓮햑풍표(): > > try: > > 픥e헅핝횘︴ = "Hello" > > 함픬r퓵ᵈ﹎ = "World" > > ᵖ햗퐢혯퓽(f"{헵e퓵픩º_}, {햜ₒ풓lⅆ︴}!") > > except 퓣핪ᵖe햤헿ᵣ햔횛 as ⅇ헑c: > > 풑rℹₙₜ("failed: {}".핗헼ʳᵐªt(ᵉ퐱퓬)) > Wow. Just Wow. So why does Python apply NFKC

[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

2021-11-13 Thread Terry Reedy
On 11/13/2021 4:35 PM, pt...@austin.rr.com wrote: I’ve not been following the thread, but Steve Holden forwarded me the To explore the extreme case, I wrote a pyparsing transformer to convert identifiers in a body of Python source to mixed font, equivalent to the original source after NFKC

[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

2021-11-13 Thread Stestagg
This is my favourite version of the issue: е = lambda е, e: е if е > e else e print(е(2, 1), е(1, 2)) # python 3 outputs: 2 2 https://twitter.com/stestagg/status/685239650064162820?s=21 Steve On Sat, 13 Nov 2021 at 22:05, wrote: > I’ve not been following the thread, but Steve Holden

[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

2021-11-03 Thread Jim J. Jewett
Stephen J. Turnbull wrote: > Jim J. Jewett writes: > > At the time, we considered it, and we also considered a narrower > > restriction on using multiple scripts in the same identifier, or at > > least the same identifier portion (so it was OK if separated by > > _). > > This would ban "παν語",

[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

2021-11-03 Thread Chris Jerdonek
On Tue, Nov 2, 2021 at 7:21 AM Petr Viktorin wrote: > That brings us to possible changes in Python in this area, which is an > interesting topic. Is there a use case or need for allowing the comment-starting character “#” to occur when text is still in the right-to-left direction? Disallowing

[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

2021-11-03 Thread Serhiy Storchaka
03.11.21 14:31, Petr Viktorin пише: > For example: should the parser emit a lightweight audit event if it > finds a non-ASCII identifier? (See below for why ASCII is special.) > Or for encoding declarations? There are audit events for import and compile. You can also register import hooks if you

[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

2021-11-03 Thread Petr Viktorin
We seem to agree that this is work for linters. That's reasonable; I'd generalize it to "tools and policies". But even so, discussing what we'd expect linters to do is on topic here. Perhaps we can even find ways for the language to support linters -- type checking is also for external tools,

[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

2021-11-03 Thread Steven D'Aprano
On Tue, Nov 02, 2021 at 05:55:55PM +0200, Serhiy Storchaka wrote: > All control characters except CR, LF, TAB and FF are banned outside > comments and string literals. I think it is worth to ban them in > comments and string literals too. In string literals you can use > backslash-escape

[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

2021-11-03 Thread Serhiy Storchaka
02.11.21 18:49, Jim J. Jewett пише: > If escape sequences were also allowed in comments (or at least in strings > within comments), this would make sense. I don't like banning them > otherwise, since odd characters are often a good reason to need a comment, > but it is definitely a "mention,

[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

2021-11-02 Thread Stephen J. Turnbull
Jim J. Jewett writes: > At the time, we considered it, and we also considered a narrower > restriction on using multiple scripts in the same identifier, or at > least the same identifier portion (so it was OK if separated by > _). This would ban "παν語", aka "pango". That's arguably a good

[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

2021-11-02 Thread Stephen J. Turnbull
Serhiy Storchaka writes: > All control characters except CR, LF, TAB and FF are banned outside > comments and string literals. I think it is worth to ban them in > comments and string literals too. +1 > > For homoglyphs/confusables, should there be a SyntaxWarning when an > > identifier

[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

2021-11-02 Thread Jim J. Jewett
Serhiy Storchaka wrote: > 02.11.21 16:16, Petr Viktorin пише: > > As for \0, can we ban all ASCII & C1 control characters except > > whitespace? I see no place for them in source code. > All control characters except CR, LF, TAB and FF are banned outside > comments and string literals. I think it

[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

2021-11-02 Thread Serhiy Storchaka
02.11.21 16:16, Petr Viktorin пише: > As for \0, can we ban all ASCII & C1 control characters except > whitespace? I see no place for them in source code. All control characters except CR, LF, TAB and FF are banned outside comments and string literals. I think it is worth to ban them in comments