Re: [Python-Dev] PEP 393 Summer of Code Project

Terry Reedy Wed, 24 Aug 2011 12:57:53 -0700

On 8/24/2011 12:34 PM, Stephen J. Turnbull wrote:

Terry Reedy writes:


  >  Excuse me for believing the fine 3.2 manual that says
  >  "Strings contain Unicode characters."

The manual is wrong, then, subject to a pronouncement to the contrary,

Please suggest a re-wording then, as it is a bug for doc and behavior todisagree.

  >  For the purpose of my sentence, the same thing in that code points
  >  correspond to characters,

Not in Unicode, they do not.  By definition, a small number of code
points (eg, U+FFFF) *never* did and *never* will correspond to
characters.

On computers, characters are represented by code points. What about theother way around? http://www.unicode.org/glossary/#C says

code point:
1) i in range(0x11000) <broad definition>
2) "A value, or position, for a character" <narrow definition>
(To muddy the waters more, 'character' has multiple definitions also.)
You are using 1), I am using 2) ;-(.

  >  Any narrow build string with even 1 non-BMP char violates the
  >  standard.

Yup.  That's by design.

[...]

Sure.  Nevertheless, practicality beat purity long ago, and that
decision has never been rescinded AFAIK.

I think you have it backwards. I see the current situation as the purityof the C code beating the practicality for the user of getting rightanswers.

The thing is, that 90% of applications are not really going to care
about full conformance to the Unicode standard.

I remember when Intel argued that 99% of applications were not going tobe affected when the math coprocessor in its then new chips occasionallygave 'non-standard' answers with certain divisors.

  >  Currently, the meaning of Python code differs on narrow versus wide
  >  build, and in a way that few users would expect or want.

Let them become developers, then, and show us how to do it better.

I posted a proposal with a link to a prototype implementation in Python.It pretty well solves the problem of narrow builds acting different fromwide builds with respect to the basic operations of len(), iterations,indexing, and slicing.

No, I do like the PEP.  However, it is only a step, a rather
conservative one in some ways, toward conformance to the Unicode
character model.  In particular, it does nothing to resolve the fact
that len() will give different answers for character count depending
on normalization, and that slicing and indexing will allow you to cut
characters in half (even in NFC, since not all composed characters
have fully composed forms).

I believe my scheme could be extended to solve that also. It wouldrequire more pre-processing and more knowledge than I currently have ofnormalization. I have the impression that the grapheme problem goesfurther than just normalization.


--
Terry Jan Reedy

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 393 Summer of Code Project

Reply via email to