> Basically everything but string forming or string printing seems to be > broken for surrogate pairs, from what I can tell.
We probably disagree what "it works correctly" means. I think everything works correctly. > Also, I think you are confused about slicing in the middle of a surrogate > pair, from a UTF-16 perspective this is 1 codepoint! Yes, but it is two code units. Python's UTF-16 implementation operates on code units, not code points. > And as such Python > needs to treat it as one character/codepoint in a string, dealing with > slicing as appropriate. It does. However, functions such as len, and all indexing, operate in code units, not code points. > The way you currently describe it is that UTF-16 > strings will be treated as UCS-2 when it comes to slicing and the likes. No. In UCS-2, the surrogate range is reserved (for UTF-16). In Python, it's not reserved, but interpreted as UTF-16. > From a UTF-16 point of view such slicing can NEVER occur unless you are bit > or byte slicing instead of character/codepoint slicing. It most certainly can. UTF-16 is not a character set, but a character encoding form (unlike UCS-2, which is a coded character set). Slicing *can* occur at the code unit level. UTF-16 is also understood as a character encoding scheme (by means of the BOM), then slicing can occur even on the byte level. > I think it can be fairly said that an item in a string is a character or > codepoint. Not in Python - it's a code unit. Regards, Martin _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com