On Tue, 08 Oct 2013 15:14:33 +0000, Neil Cerutti wrote: > In any case, "\ud800\udc01" isn't a valid unicode string.
I don't think this is correct. Can you show me where the standard says that Unicode strings[1] may not contain surrogates? I think that is a critical point, and the FAQ conflates *encoded strings* (i.e. bytes using one of the UTCs) with *Unicode strings*. The string you give above is is a Unicode string containing two code points, the surrogates U+D800 U+DC01, which as far as I am concerned is a legal string (subject to somebody pointing me to a definitive source that proves it is not). However, it *may or may not* be encodable to bytes using UTF-8, -16 or -32. Just as there are byte sequences that cannot be generated by the UTFs, possibly there are code point sequences that cannot be converted to bytes using the UTFs. > In a perfect > world it would automatically get converted to '\u00010001' without > intervention. I certainly hope not, because Unicode string != UTF-16. This is equivalent to saying: When encoding the sequence of code points '\ud800\udc01' to UTF-8 bytes, you should get the same result as if you treated the sequence of code points as if it were bytes, decoded it using UTF-16, and then encoded using UTF-8. That would be a horrible, horrible design, since it privileges UTF-16 in a completely inappropriate way. I *really* hope I am wrong, but I fear that is my interpretation of the FAQ. [1] Sequences of Unicode code points. -- Steven -- https://mail.python.org/mailman/listinfo/python-list