I think this is a bug in Python's UTF-8 handling, but I'm not sure. If I've read the Unicode FAQs correctly, you cannot encode *lone* surrogate code points into UTF-8:
http://www.unicode.org/faq/utf_bom.html#utf8-5 Sure enough, using Python 3.3: py> surr = '\udc80' py> surr.encode('utf-8') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in position 0: surrogates not allowed But reading the previous entry in the FAQs: http://www.unicode.org/faq/utf_bom.html#utf8-4 I interpret this as meaning that I should be able to encode valid pairs of surrogates. So if I find a code point that encodes to a surrogate pair in UTF-16: py> c = '\N{LINEAR B SYLLABLE B038 E}' py> surr_pair = c.encode('utf-16be') py> print(surr_pair) b'\xd8\x00\xdc\x01' and then use those same values as the code points, I ought to be able to encode to UTF-8, as if it were the same \N{LINEAR B SYLLABLE B038 E} code point. But I can't: py> s = '\ud800\udc01' py> s.encode('utf-8') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed Have I misunderstood? I think that Python is being too strict about rejecting surrogate code points. It should only reject lone surrogates, or invalid pairs, not valid pairs. Have I misunderstood the Unicode FAQs, or is this a bug in Python's handling of UTF-8? -- Steven -- https://mail.python.org/mailman/listinfo/python-list