Serhiy Storchaka added the comment:

> It's Unicode that considers unpaired surrogates invalid, not UTF-8 by itself.

It's UTF-8 too. See RFC 3629:

   The definition of UTF-8 prohibits encoding character numbers between
   U+D800 and U+DFFF, which are reserved for use with the UTF-16
   encoding form (as surrogate pairs) and do not directly represent
   characters.  When encoding in UTF-8 from UTF-16 data, it is necessary
   to first decode the UTF-16 data to obtain character numbers, which
   are then encoded in UTF-8 as described above.

----------
nosy: +storchaka

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue11489>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to