Ezio Melotti <ezio.melo...@gmail.com> added the comment:

FWIW Wikipedia says "Other characters must be encoded in UTF-16 (hence U+10000 
and higher would be encoded into surrogates) and then in modified Base64."

So one possible interpretation is that while encoding a non-BMP char, it should 
be first converted in a surrogate pair and then each of the surrogates should 
be encoded just like any other 16bit code unit.
While decoding, it seems reasonable to do the opposite, i.e. recombine the 
surrogate pair.

The RFC doesn't say anything about lone surrogates, but I think that the fact 
that surrogates are used internally doesn't necessarily mean that the codec 
should be able to encode/decode them when they are not paired.  The other UTF-* 
codecs reject them, but that's because it is explicitly forbidden by their 
respective standards.

So I'm +1 about recombining them while decoding, and ±0 about allowing lone 
surrogates.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue13333>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to