Roel Schroeven <[EMAIL PROTECTED]> writes:
> In this case s[0] is not the full Unicode scalar, but instead just the
> first part of the surrogate pair consisting of 0x1D40 (in s[0]) and
> 0x0000 (in s[1]).

Arrrrgggh.  After much head scratching I think I now understand what
you are saying.  This appears to me to be absolutely nuts.  What is
the purpose of having a unicode string type, if its sequence elements
are not guaranteed to be the unicode characters in the string?  Might
as well use byte strings for everything.

Come to think of it, I don't understand why we have this plethora of
encodings like utf-16.  utf-8 I can sort of understand on pragmatic
grounds, but aside from that I'd think UCS-4 should be used for everything,
and when a space-saving compressed representation is desired, then use
a general purpose data compression algorithm such as gzip.
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to