Terry Reedy writes: > Please suggest a re-wording then, as it is a bug for doc and behavior to > disagree.
Strings contain Unicode code units, which for most purposes can be treated as Unicode characters. However, even as "simple" an operation as "s1[0] == s2[0]" cannot be relied upon to give Unicode-conforming results. The second sentence remains true under PEP 393. > > > For the purpose of my sentence, the same thing in that code points > > > correspond to characters, > > > > Not in Unicode, they do not. By definition, a small number of code > > points (eg, U+FFFF) *never* did and *never* will correspond to > > characters. > > On computers, characters are represented by code points. What about the > other way around? http://www.unicode.org/glossary/#C says > code point: > 1) i in range(0x11000) <broad definition> > 2) "A value, or position, for a character" <narrow definition> > (To muddy the waters more, 'character' has multiple definitions also.) > You are using 1), I am using 2) ;-(. No, you're not. You are claiming an isomorphism, which Unicode goes to great trouble to avoid. > I think you have it backwards. I see the current situation as the purity > of the C code beating the practicality for the user of getting right > answers. Sophistry. "Always getting the right answer" is purity. > > The thing is, that 90% of applications are not really going to care > > about full conformance to the Unicode standard. > > I remember when Intel argued that 99% of applications were not going to > be affected when the math coprocessor in its then new chips occasionally > gave 'non-standard' answers with certain divisors. In the case of Intel, the people who demanded standard answers did so for efficiency reasons -- they needed the FPU to DTRT because implementing FP in software was always going to be too slow. CPython, IMO, can afford to trade off because the implementation will necessarily be in software, and can be added later as a Python or C module. > I believe my scheme could be extended to solve [conformance for > composing characters] also. It would require more pre-processing > and more knowledge than I currently have of normalization. I have > the impression that the grapheme problem goes further than just > normalization. Yes and yes. But now you're talking about database lookups for every character (to determine if it's a composing character). Efficiency of a generic implementation isn't going to happen. Anyway, in Martin's rephrasing of my (imperfect) memory of Guido's pronouncement, "indexing is going to be O(1)". And Nick's point about non-uniform arrays is telling. I have 20 years of experience with an implementation of text as a non-uniform array which presents an array API, and *everything* needs to be special-cased for efficiency, and *any* small change can have show-stopping performance implications. Python can probably do better than Emacs has done due to much better leadership in this area, but I still think it's better to make full conformance optional. _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com