Re: [Python-Dev] PEP 393 Summer of Code Project

Stephen J. Turnbull Wed, 24 Aug 2011 17:33:16 -0700

Terry Reedy writes:

 > Please suggest a re-wording then, as it is a bug for doc and behavior to 
 > disagree.


    Strings contain Unicode code units, which for most purposes can be
    treated as Unicode characters.  However, even as "simple" an
    operation as "s1[0] == s2[0]" cannot be relied upon to give
    Unicode-conforming results.

The second sentence remains true under PEP 393.

 > >   >  For the purpose of my sentence, the same thing in that code points
 > >   >  correspond to characters,
 > >
 > > Not in Unicode, they do not.  By definition, a small number of code
 > > points (eg, U+FFFF) *never* did and *never* will correspond to
 > > characters.
 > 
 > On computers, characters are represented by code points. What about the 
 > other way around? http://www.unicode.org/glossary/#C says
 > code point:
 > 1) i in range(0x11000) <broad definition>
 > 2) "A value, or position, for a character" <narrow definition>
 > (To muddy the waters more, 'character' has multiple definitions also.)
 > You are using 1), I am using 2) ;-(.

No, you're not.  You are claiming an isomorphism, which Unicode goes
to great trouble to avoid.

 > I think you have it backwards. I see the current situation as the purity 
 > of the C code beating the practicality for the user of getting right 
 > answers.

Sophistry.  "Always getting the right answer" is purity.

 > > The thing is, that 90% of applications are not really going to care
 > > about full conformance to the Unicode standard.
 > 
 > I remember when Intel argued that 99% of applications were not going to 
 > be affected when the math coprocessor in its then new chips occasionally 
 > gave 'non-standard' answers with certain divisors.

In the case of Intel, the people who demanded standard answers did so
for efficiency reasons -- they needed the FPU to DTRT because
implementing FP in software was always going to be too slow.  CPython,
IMO, can afford to trade off because the implementation will
necessarily be in software, and can be added later as a Python or C module.

 > I believe my scheme could be extended to solve [conformance for
 > composing characters] also. It would require more pre-processing
 > and more knowledge than I currently have of normalization. I have
 > the impression that the grapheme problem goes further than just
 > normalization.

Yes and yes.  But now you're talking about database lookups for every
character (to determine if it's a composing character).  Efficiency of
a generic implementation isn't going to happen.

Anyway, in Martin's rephrasing of my (imperfect) memory of Guido's
pronouncement, "indexing is going to be O(1)".  And Nick's point about
non-uniform arrays is telling.  I have 20 years of experience with an
implementation of text as a non-uniform array which presents an array
API, and *everything* needs to be special-cased for efficiency, and
*any* small change can have show-stopping performance implications.

Python can probably do better than Emacs has done due to much better
leadership in this area, but I still think it's better to make full
conformance optional.
_______________________________________________
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 393 Summer of Code Project

Reply via email to