Guido van Rossum writes:
> I see nothing wrong with having the language's fundamental data types
> (i.e., the unicode object, and even the re module) to be defined in
> terms of codepoints, not characters, and I see nothing wrong with
> len() returning the number of codepoints (as long as it is advertised
> as such).
In fact, the Unicode Standard, Version 6, goes farther (to code units):
2.7 Unicode Strings
A Unicode string data type is simply an ordered sequence of code
units. Thus a Unicode 8-bit string is an ordered sequence of 8-bit
code units, a Unicode 16-bit string is an ordered sequence of
16-bit code units, and a Unicode 32-bit string is an ordered
sequence of 32-bit code units.
Depending on the programming environment, a Unicode string may or
may not be required to be in the corresponding Unicode encoding
form. For example, strings in Java, C#, or ECMAScript are Unicode
16-bit strings, but are not necessarily well-formed UTF-16
sequences.
(p. 32).
_______________________________________________
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com