Ross Ridge <[EMAIL PROTECTED]> writes: > The Unicode standard doesn't require that you support surrogates, > or any other kind of character, so no you wouldn't be lying.
+1 on Ross Ridge's contributions to this thread. If Unicode is processed using UTF-8 or UTF-32 encoding forms then there are no surrogates. They would only be present in UTF-16. CESU-8 is strongly discouraged. A Unicode 16-bit string is allowed to be ill-formed as UTF-16. The example they give is one string that ends with a high surrogate code point and another that starts with a low surrogate code point. The result of concatenation is a valid UTF-16 string. The above refers to the Unicode standard. In Python with narrow Py_UNICODE a unicode string is a sequence of 16-bit Unicode code points. It is up to the programmer whether they want to specially handle code points for surrogates. Operations based on concatenation will conform to Unicode, whether or not there are surrogates in the strings. -- Pete Forman -./\.- Disclaimer: This post is originated WesternGeco -./\.- by myself and does not represent [EMAIL PROTECTED] -./\.- the opinion of Schlumberger or http://petef.port5.com -./\.- WesternGeco. -- http://mail.python.org/mailman/listinfo/python-list