On Thursday, April 7, 2016 at 10:22:18 PM UTC+5:30, Peter Pearson wrote: > On Thu, 07 Apr 2016 11:37:50 +1000, Steven D'Aprano wrote: > > On Thu, 7 Apr 2016 05:56 am, Thomas 'PointedEars' Lahn wrote: > >> Rustom Mody wrote: > > > >>> So here are some examples to illustrate what I am saying: > >>> > >>> Example 1 -- Ligatures: > >>> > >>> Python3 gets it right > >>>>>> flag = 1 > >>>>>> flag > >>> 1 > [snip] > >> > >> I do not think this is correct, though. Different Unicode code sequences, > >> after normalization, should result in different symbols. > > > > I think you are confused about normalisation. By definition, normalising > > different Unicode code sequences may result in the same symbols, since that > > is what normalisation means. > > > > Consider two distinct strings which nevertheless look identical: > > > > py> a = "\N{LATIN SMALL LETTER U}\N{COMBINING DIAERESIS}" > > py> b = "\N{LATIN SMALL LETTER U WITH DIAERESIS}" > > py> a == b > > False > > py> print(a, b) > > ü ü > > > > > > The purpose of normalisation is to turn one into the other: > > > > py> unicodedata.normalize('NFKC', a) == b # compose 2 code points --> 1 > > True > > py> unicodedata.normalize('NFKD', b) == a # decompose 1 code point --> 2 > > True > > It's all great fun until someone loses an eye. > > Seriously, it's cute how neatly normalisation works when you're > watching closely and using it in the circumstances for which it was > intended, but that hardly proves that these practices won't cause much > trouble when they're used more casually and nobody's watching closely. > Considering how much energy good software engineers spend eschewing > unnecessary complexity, do we really want to embrace the prospect of > having different things look identical? (A relevant reference point: > mixtures of spaces and tabs in Python indentation.)
That kind of sums up my position. To be a casual user of unicode is one thing To support it is another -- unicode strings in python3 -- ok so far To mix up these two is a third without enough thought or consideration -- unicode identifiers is likely a security hole waiting to happen... No I am not clever/criminal enough to know how to write a text that is visually close to print "Hello World" but is internally closer to rm -rf / For me this: >>> Α = 1 >>> A = 2 >>> Α + 1 == A True >>> is cure enough that I am not amused [The only reason I brought up case distinction is that this is in the same direction and way worse than that] If python had been more serious about embracing the brave new world of unicode it should have looked in this direction: http://blog.languager.org/2014/04/unicoded-python.html Also here I suggest a classification of unicode, that, while not official or even formalizable is (I believe) helpful http://blog.languager.org/2015/03/whimsical-unicode.html Specifically as far as I am concerned if python were to throw back say a ligature in an identifier as a syntax error -- exactly what python2 does -- I think it would be perfectly fine and a more sane choice -- https://mail.python.org/mailman/listinfo/python-list