Roel Schroeven <[EMAIL PROTECTED]> writes: > In this case s[0] is not the full Unicode scalar, but instead just the > first part of the surrogate pair consisting of 0x1D40 (in s[0]) and > 0x0000 (in s[1]).
Arrrrgggh. After much head scratching I think I now understand what you are saying. This appears to me to be absolutely nuts. What is the purpose of having a unicode string type, if its sequence elements are not guaranteed to be the unicode characters in the string? Might as well use byte strings for everything. Come to think of it, I don't understand why we have this plethora of encodings like utf-16. utf-8 I can sort of understand on pragmatic grounds, but aside from that I'd think UCS-4 should be used for everything, and when a space-saving compressed representation is desired, then use a general purpose data compression algorithm such as gzip. -- http://mail.python.org/mailman/listinfo/python-list