Re: Encoding of surrogate code points to UTF-8

Terry Reedy Tue, 08 Oct 2013 15:19:39 -0700

On 10/8/2013 5:47 PM, Terry Reedy wrote:

On 10/8/2013 9:52 AM, Steven D'Aprano wrote:

But reading the previous entry in the FAQs:

http://www.unicode.org/faq/utf_bom.html#utf8-4

I interpret this as meaning that I should be able to encode valid pairs
of surrogates.


It says you should be able to 'convert' them, and that the result for
utf-8 encoding must be a single 4-bytes code for the corresponding
supplementary codepoint.

To expand on this: The FAQ question is "How do I convert a UTF-16surrogate pair such as <D800 DC00> to UTF-8?" utf-16 and utf-8 are bothbyte (or double byte) encodings of codepoints. Direct conversion wouldbe 'transcoding', not encoding. Python has a few bytes transcoders andone string transcoder (rot_13), listed at the end of

http://docs.python.org/3/library/codecs.html#python-specific-encodings
But in general, one must decode bytes to string and encode back to bytes.

So if I find a code point that encodes to a surrogate pair
in UTF-16:

py> c = '\N{LINEAR B SYLLABLE B038 E}'
py> surr_pair = c.encode('utf-16be')
py> print(surr_pair)
b'\xd8\x00\xdc\x01'

and then use those same values as the code points, I ought to be able to
encode to UTF-8, as if it were the same \N{LINEAR B SYLLABLE B038 E} code
point.

I believe the utf encodings are defined as 1 to 1. If the above worked,utf-8 would not be.


--
Terry Jan Reedy

--
https://mail.python.org/mailman/listinfo/python-list

Re: Encoding of surrogate code points to UTF-8

Reply via email to