On Thu, 7 Apr 2016 05:56 am, Thomas 'PointedEars' Lahn wrote: > Rustom Mody wrote:
>> So here are some examples to illustrate what I am saying: >> >> Example 1 -- Ligatures: >> >> Python3 gets it right >>>>> flag = 1 >>>>> flag >> 1 Python identifiers are intentionally normalised to reduce security issues, or at least confusion and annoyance, due to visually-identical identifiers being treated as different. Unicode has technical standards dealing with identifiers: http://www.unicode.org/reports/tr31/ and visual spoofing and confusables: http://www.unicode.org/reports/tr39/ I don't believe that CPython goes to the full extreme of checking for mixed script confusables, but it does partially mitigate the problem by normalising identifiers. Unfortunately PEP 3131 leaves a number of questions open. Presumably they were answered in the implementation, but they aren't documented in the PEP. https://www.python.org/dev/peps/pep-3131/ > Fascinating; confirmed with > > | $ python3 > | Python 3.4.4 (default, Jan 5 2016, 15:35:18) > | [GCC 5.3.1 20160101] on linux > | […] > > I do not think this is correct, though. Different Unicode code sequences, > after normalization, should result in different symbols. I think you are confused about normalisation. By definition, normalising different Unicode code sequences may result in the same symbols, since that is what normalisation means. Consider two distinct strings which nevertheless look identical: py> a = "\N{LATIN SMALL LETTER U}\N{COMBINING DIAERESIS}" py> b = "\N{LATIN SMALL LETTER U WITH DIAERESIS}" py> a == b False py> print(a, b) ü ü The purpose of normalisation is to turn one into the other: py> unicodedata.normalize('NFKC', a) == b # compose 2 code points --> 1 True py> unicodedata.normalize('NFKD', b) == a # decompose 1 code point --> 2 True In the case of the fl ligature, normalisation splits the ligature into individual 'f' and 'l' code points regardless of whether you compose or decompose: py> unicodedata.normalize('NFKC', "flag") == "flag" True py> unicodedata.normalize('NFKD', "flag") == "flag" True That's using the combatability composition form. Using the default composition form leaves the ligature unchanged. Note that UTS #39 (security mechanisms) suggests that identifiers should be normalised using NFKC. [...] > I think Haskell gets it right here, while Py3k does not. The “fl” is not > to be decomposed to “fl”. The Unicode consortium seems to disagree with you. Table 1 of UTS #39 (see link above) includes "Characters that cannot occur in strings normalized to NFKC" in the Restricted category, that is, characters which should not be used in identifiers. fl cannot occur in such normalised strings, and so it is classified as Restricted and should not be used in identifiers. I'm not entirely sure just how closely Python's identifiers follow the standard, but I think that the intention is to follow something close to "UAX31-R4. Equivalent Normalized Identifiers": http://www.unicode.org/reports/tr31/#R4 [Rustom] >> Python gets it wrong >>>>> a=1 >>>>> A >> Traceback (most recent call last): >> File "<stdin>", line 1, in <module> >> NameError: name 'A' is not defined > > This is not wrong; it is just different. I agree with Thomas here. Case-insensitivity is a choice, and I don't think it is a good choice for programming identifiers. Being able to make case distinctions between (let's say): SPAM # a constant, or at least constant-by-convention Spam # a class or type spam # an instance is useful. [Rustom] >> With ASCII the problems are minor: Case-distinct identifiers are distinct >> -- they dont IDENTIFY. > > I do not think this is a problem. > >> This contradicts standard English usage and practice > > No, it does not. I agree with Thomas here too. Although it is rare for case to make a distinction in English, it does happen. As the old joke goes: Capitalisation is the difference between helping my Uncle Jack off a horse, and helping my uncle jack off a horse. So even in English, capitalisation can make a semantic difference. -- Steven -- https://mail.python.org/mailman/listinfo/python-list