Kristján Valur Jónsson <krist...@ccpgames.com> added the comment:

I conffess that I didn't follow the utf-8/surrogate discussion.
But the utf-8 encoding can encode all valid unicode characters:

UTF-8 may only legally be used to encode valid Unicode scalar values. According 
to the Unicode standard the high and low surrogate halves used by UTF-16 
(U+D800 through U+DFFF) and values above U+10FFFF are not legal Unicode values, 
and the UTF-8 encoding of them is an invalid byte sequence and should be 
treated as described above. (from wikipedia)

If we encounter surrogate halves when encoding (unicode) to utf-8, it means 
that we are really trying to decode utf-16 and reencode it as utf-8.  (and that 
python is using 16 bits for its unicode chars).  the utf--8 codec should be 
smart enough to merge the surrogates into a utf-32 char, and encode that.

Anyway, as you remark, my approach is a _patch_, designed to make python (2.x) 
work in an unicode environment, with the least amount of code change, for those 
willing to commit such a patch.  In 3.x you may want to do things differently.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue1552880>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to