Indeed, normative annex https://www.unicode.org/reports/tr31/tr31-35.html section 5 says: "if the programming language has case-sensitive identifiers, then Normalization Form C is appropriate" (vs NFKC for a language with case-insensitive identifiers) so to follow the standard we should have used NFC rather than NFKC. Not sure if it's too late to fix this "oops" in future Python versions.
Alex On Sun, Nov 14, 2021 at 9:17 AM Christopher Barker <python...@gmail.com> wrote: > On Sat, Nov 13, 2021 at 2:03 PM <pt...@austin.rr.com> wrote: > >> def ๐๐ฎ๐๐๐(): >> >> try: >> >> ๐ฅe๐ ๐๐๏ธด = "Hello" >> >> ๐จ๐ฌr๐ตแต๏น = "World" >> >> แต๐๐ข๐ฏ๐ฝ(f"{๐ต๏ฝ ๐ต๐ฉยบ_}, {๐โ๐lโ ๏ธด}!") >> >> except ๐ฃ๐ชแต๏ฝ ๐ค๐ฟแตฃ๐๐ as โ ๐c: >> >> ๐rโนโโ("failed: {}".๐๐ผสณแตยช๏ฝ(แต๐ฑ๐ฌ)) >> > > Wow. Just Wow. > > So why does Python apply NFKC normalization to variable names?? I can't > for the life of me figure out why that would be helpful at all. > > The string methods, sure, but names? > > And, in fact, the normalization is not used for string comparisons or > hashes as far as I can tell. > > In [36]: weird > Out[36]: 'แต๐๐ข๐ฏ๐ฝ' > > In [37]: normal > Out[37]: 'print' > > In [38]: eval(weird + "('yup, that worked')") > yup, that worked > > In [39]: weird == normal > Out[39]: False > > In [40]: weird[0] in normal > Out[40]: False > > This seems very odd (and dangerous) to me. > > Is there a good reason? and is it too late to change it? > > -CHB > > > > > > > > > >> >> >> if _๏ธดโฟ๐ช๐๐__ == "__main__": >> >> ๐eโหก๐() >> >> >> >> >> >> # snippet from unittest/util.py >> >> _๐โ ฌ๐ ๐ฒ๐โ๐ชLแดฐ๐ฌ๐ฝ๏น๐ท๐ผ๐ก = 12 >> >> def _๐ฐสฐ๐ธสณ๐ฅ๐๐(๐ฐ, p๐๐ข๏ฌ๐๐๐๐, ๏ฝแตค๐๐ณ๐๐ฅ๐นโ๐): >> >> หข๐ธ๏ฝ๐ฝ = ๐ฅ๏ฝ ๐ฏ(๐) - ๏ฝr๐๐๐ขx๐ แต๐ท - ๐๐ช๏ฌ๏ฝ๐ ๐น๐โ >> >> if s๏ฝi๐ฑ > _๐๐๐ ๐๐ดH๐บ๏ผฌ๐ฏ๐๐๏นL๐๐ฉ: >> >> ๐ด = '%s[%d chars]%s' % (๐จ[:๐ฑ๐ซ๐๐๐๏ฝโ๐๐], โ๐๐p, ๐ผ[๐๐๐ >> (๐) - ๐จ๐๐๏ฌx๐กแต๐ฏ:]) >> >> return โ >> >> >> >> >> >> You should able to paste these into your local UTF-8-aware editor or IDE >> and execute them as-is. >> >> >> >> (If this doesnโt come through, you can also see this as a GitHub gist at >> Hello, >> World rendered in a variety of Unicode characters (github.com) >> <https://gist.github.com/ptmcg/bf35d5ada416080d481d789988b6b466>. I have >> a second gist containing the transformer, but it is still a private gist >> atm.) >> >> >> >> >> >> Some other discoveries: >> >> โยทโ (ASCII 183) is a valid identifier body character, making โ_ยทยทยทโ a >> valid Python identifier. This could actually be another security attack >> point, in which โsยทjoin(โxโ)โ could be easily misread as โs.join(โxโ)โ, but >> would actually be a call to potentially malicious method โsยทjoinโ. >> >> โ_โ seems to be a special case for normalization. Only the ASCII โ_โ >> character is valid as a leading identifier character; the Unicode >> characters that normalize to โ_โ (any of the characters in โ๏ธณ๏ธด๏น๏น๏น๏ผฟโ) can >> only be used as identifier body characters. โ๏ธณโ especially could be >> misread as โ|โ followed by a space, when it actually normalizes to โ_โ. >> >> >> >> >> >> Potential beneficial uses: >> >> I am considering taking my transformer code and experimenting with an >> orthogonal approach to syntax highlighting, using Unicode groups instead of >> colors. Module names using characters from one group, builtins from >> another, program variables from another, maybe distinguish local from >> global variables. Colorizing has always been an obvious syntax highlight >> feature, but is an accessibility issue for those with difficulty >> distinguishing colors. Unlike the โransom noteโ code above, code >> highlighted in this way might even be quite pleasing to the eye. >> >> >> >> >> >> -- Paul McGuire >> >> >> >> >> _______________________________________________ >> Python-Dev mailing list -- python-dev@python.org >> To unsubscribe send an email to python-dev-le...@python.org >> https://mail.python.org/mailman3/lists/python-dev.python.org/ >> Message archived at >> https://mail.python.org/archives/list/python-dev@python.org/message/GBLXJ2ZTIMLBD2MJQ4VDNUKFFTPPIIMO/ >> Code of Conduct: http://python.org/psf/codeofconduct/ >> > > > -- > Christopher Barker, PhD (Chris) > > Python Language Consulting > - Teaching > - Scientific Software Development > - Desktop GUI and Web Development > - wxPython, numpy, scipy, Cython >
_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/U3DJOQKMREWY35SHCRSD6V6FQA2T3SW7/ Code of Conduct: http://python.org/psf/codeofconduct/