>> Strings contain Unicode code units, which for most purposes can be >> treated as Unicode characters. However, even as "simple" an >> operation as "s1[0] == s2[0]" cannot be relied upon to give >> Unicode-conforming results. >> >> The second sentence remains true under PEP 393. > > Really? If strings contain code units, that expression compares code > units. What is non-conforming about comparing two code points? They > are just integers. > > Seriously, what does Unicode-conforming mean here?
I think he's referring to combining characters and normal forms. 2.12 starts with "In cases involving two or more sequences considered to be equivalent, the Unicode Standard does not prescribe one particular sequence as being the correct one; instead, each sequence is merely equivalent to the others" That could be read to imply that the == operator should determine whether two strings are equivalent. However, the Unicode standard clearly leaves API design to the programming environment, and has the notion of conformance only for processes. So saying that Python is or is not unicode-conforming is, strictly speaking, meaningless. The closest conformance requirement in that respect is C6 "A process shall not assume that the interpretations of two canonical-equivalent character sequences are distinct." However, that explicitly does *not* support the conformance statement that Stephen made. They elaborate "Ideally, an implementation would always interpret two canonical-equivalent character sequences identically. There are practical circumstances under which implementations may reasonably distinguish them." So practicality beats purity even in Unicode conformance: the == operator of Python can reasonably treat equivalent strings as unequal (and there is a good reason for that, indeed). Processes should not expect that other applications make the same distinction, so they need to cope if it matters to them. There are different way to do that: - normalize all strings on input, and then use == - use a different comparison operation that always normalizes its input first > This I agree with (though if you were referring to me with > "leadership" I consider myself woefully underinformed about Unicode > subtleties). I also suspect that Unicode "conformance" (however > defined) is more part of a political battle than an actual necessity. Fortunately, it's much better than that. Unicode had very clear conformance requirements for a long time, and they aren't hard to meet. Wrt. C6, Python could certainly improve, e.g. by caching whether a string had been determined to be in normal form, so that applications can more reasonably apply normalization to all strings they ever want to compare. Regards, Martin _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com