On 6/5/07, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > >> > Unicode does say pretty clearly that (at least) canonical > >> > equivalents must be treated the same.
On reflection, what it actually says is that you may not assume they are different. They can be different in the same way that two identical strings are different under "is", but anything stronger has to be strictly internal. If any code outside the python core even touches the string, then the the choice of representations becomes arbitrary, and can switch for spurious reasons. Immutability should prevent mid-run switching for a single "is" string, but not for different strings that should compare "==". Dictionaries keys need to keep working, which means hash and equality have to do the right thing. Ordering may technically be a quality-of-implementation issue, but ... normalizing strings on creation solves an awful lot of problems, including providing a "best practice" for C extensions. Not normalizing will save a small amount of time, at the cost of a never-ending hunt for rare and obscure bugs. > >> Chapter and verse, please? > > I am pretty sure this list is not exhaustive, but it may be > > helpful: > > The Identifiers Annex http://www.unicode.org/reports/tr31/ > Ah, that's in the context of identifiers, not in the context of text > in general. Yes, but that should also apply to dict and shelve keys. If you want an array of code points, then you want a tuple of ints, not text. > > """ > > Normalization Forms KC and KD must not be blindly > > applied to arbitrary text. > > """ Note that it lists only the Kompatibility forms. By implication, forms NFC and NFD *can* be blindly applied to arbitrary text. (And conformance rule C9 means you have to assume that someone else might do so, if, say, the text is python source code that may have been externally edited.) ... """ > > They can be applied more freely to domains with restricted > > character sets, such as in Section 13, Programming > > Language Identifiers. > > """ > > (section 13 then forwards back to UAX31) > How is that a requirement that comparison should apply > normalization? It isn't a requirement that we apply normalization. But (1) There is a requirement that semantics not change based on external canonical [de]normalization of source code, including literal strings. (I agree that explicit python-level escapes -- made after the file has already been converted from bytes to characters -- are legitimate, just as changing 1.0 from a string to a number is legitimate.) (2) It is a *suggestion* that we consider the stronger Kompatibility normalizations for source code. There are cases where strings which are equal under Kompatibility shouldl be treated differently, but, I think, in practice, the difference is more likely to be from typos or difficulty entering the proper characters. Normalizing to the compatibility form would be helpful for some people (Japanese and Korean input was mentioned). I think needed to distinguish the Kompatibility characters (and not even in data; in source literals) will be rare enough that it is worth making the distinction explicit. (If you need to use a compatibility character, then use an escape, rather than the character, so that people will know you really mean the alternate, instead of the "normal" character looking like that.) -jJ _______________________________________________ Python-3000 mailing list [email protected] http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com
