Re: [Python-Dev] PEP 393 Summer of Code Project

Glenn Linderman Wed, 31 Aug 2011 12:22:50 -0700

On 8/31/2011 10:20 AM, Guido van Rossum wrote:

On Wed, Aug 31, 2011 at 1:09 AM, Glenn Linderman<[email protected]>  wrote:

The str type itself can presently be used to process other
character encodings: if they are fixed width<  32-bit elements those
encodings might be considered Unicode encodings, but there is no requirement
that they are, and some operations on str may operate with knowledge of some
Unicode semantics, so there are caveats.

Actually, the str type in Python 3 and the unicode type in Python 2
are constrained everywhere to either 16-bit or 21-bit "characters".
(Except when writing C code, which can do any number of invalid things
so is the equivalent of assuming 1 == 0.) In particular, on a wide
build, there is no way to get a code point>= 2**21, and I don't want
PEP 393 to change this. So at best we can use these types to repesent
arrays of 21-bit unsigned ints. But I think it is more useful to think
of them as always representing "some form of Unicode", whether that is
UTF-16 (on narrow builds) or 21-bit code points or perhaps some
vaguely similar superset -- but for those code units/code points that
are representable *and* valid (either code points or code units)
according to the (supported version of) the Unicode standard, the
meaning of those code points/units matches that of the standard.


Note that this is different from the bytes type, where the meaning of
a byte is entirely determined by what it means in the programmer's
head.

Sorry, my Perl background is leaking through. I didn't double checkthat str constrains the values of each element to range 0x110000 but Isee now by testing that it does. For some of my ideas, then, either asubtype of str would have to be able to relax that constraint, or strwould not be the appropriate base type to use (but there are other basetypes that could be used, so this is not a serious issue for the ideas).

I have no problem with thinking of str as representing "some form ofUnicode". None of my proposals change that, although they may changeother things, and may invent new forms of Unicode representations. Youhave stated that it is better to document what str actually does, ratherthan attempt to adhere slavishly to Unicode standard concepts. TheUnicode Consortium may well define legal, conforming bytestreams forcommunicating processes, but languages and applications are free to useother representations internally. We can either artificially constrainourselves to minor tweaks of the legal conforming bytestreams, or we caninvent a representation (whether called str or something else) that isuseful and efficient in practice.

_______________________________________________
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 393 Summer of Code Project

Reply via email to