Re: Encoding of surrogate code points to UTF-8

MRAB Tue, 08 Oct 2013 10:03:05 -0700

On 08/10/2013 16:23, Pete Forman wrote:

Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info> writes:

I think this is a bug in Python's UTF-8 handling, but I'm not sure.

[snip]

py> s = '\ud800\udc01'
py> s.encode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in
position 0: surrogates not allowed


Have I misunderstood? I think that Python is being too strict about
rejecting surrogate code points. It should only reject lone surrogates,
or invalid pairs, not valid pairs. Have I misunderstood the Unicode FAQs,
or is this a bug in Python's handling of UTF-8?


http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf

D75 Surrogate pair: A representation for a single abstract character
   that consists of a sequence of two 16-bit code units, where the first
   value of the pair is a high-surrogate code unit and the second value
   is a low-surrogate code unit.

* Surrogate pairs are used only in UTF-16. (See Section 3.9, Unicode
   EncodingForms.)

* Isolated surrogate code units have no interpretation on their own.
   Certain other isolated code units in other encoding forms also have no
   interpretation on their own. For example, the isolated byte [\x80] has
   no interpretation in UTF-8; it can be used only as part of a multibyte
   sequence. (See Table 3-7). It could be argued that this line by itself
   should raise an error.


That first bullet indicates that it is indeed illegal to use surrogate
pairs in UTF-8 or UTF-32.

The only time you should get a surrogate pair in a Unicode string is in
a narrow build, which doesn't exist in Python 3.3 and later.

--
https://mail.python.org/mailman/listinfo/python-list

Re: Encoding of surrogate code points to UTF-8

Reply via email to