On 8/24/2011 12:34 PM, Stephen J. Turnbull wrote:
Terry Reedy writes:

  >  Excuse me for believing the fine 3.2 manual that says
  >  "Strings contain Unicode characters."

The manual is wrong, then, subject to a pronouncement to the contrary,

Please suggest a re-wording then, as it is a bug for doc and behavior to disagree.

  >  For the purpose of my sentence, the same thing in that code points
  >  correspond to characters,

Not in Unicode, they do not.  By definition, a small number of code
points (eg, U+FFFF) *never* did and *never* will correspond to
characters.

On computers, characters are represented by code points. What about the other way around? http://www.unicode.org/glossary/#C says
code point:
1) i in range(0x11000) <broad definition>
2) "A value, or position, for a character" <narrow definition>
(To muddy the waters more, 'character' has multiple definitions also.)
You are using 1), I am using 2) ;-(.

  >  Any narrow build string with even 1 non-BMP char violates the
  >  standard.

Yup.  That's by design.
[...]
Sure.  Nevertheless, practicality beat purity long ago, and that
decision has never been rescinded AFAIK.

I think you have it backwards. I see the current situation as the purity of the C code beating the practicality for the user of getting right answers.

The thing is, that 90% of applications are not really going to care
about full conformance to the Unicode standard.

I remember when Intel argued that 99% of applications were not going to be affected when the math coprocessor in its then new chips occasionally gave 'non-standard' answers with certain divisors.

  >  Currently, the meaning of Python code differs on narrow versus wide
  >  build, and in a way that few users would expect or want.

Let them become developers, then, and show us how to do it better.

I posted a proposal with a link to a prototype implementation in Python. It pretty well solves the problem of narrow builds acting different from wide builds with respect to the basic operations of len(), iterations, indexing, and slicing.

No, I do like the PEP.  However, it is only a step, a rather
conservative one in some ways, toward conformance to the Unicode
character model.  In particular, it does nothing to resolve the fact
that len() will give different answers for character count depending
on normalization, and that slicing and indexing will allow you to cut
characters in half (even in NFC, since not all composed characters
have fully composed forms).

I believe my scheme could be extended to solve that also. It would require more pre-processing and more knowledge than I currently have of normalization. I have the impression that the grapheme problem goes further than just normalization.

--
Terry Jan Reedy

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to