On Sun, 22 Jan 2017 06:52 am, Marko Rauhamaa wrote: > Pete Forman <petef4+use...@gmail.com>: > >> Surrogates only exist in UTF-16. They are expressly forbidden in UTF-8 >> and UTF-32. > > Also, they don't exist as Unicode code points. Python shouldn't allow > surrogate characters in strings.
Not quite. This is where it gets a bit messy and confusing. The bottom line is: surrogates *are* code points, but they aren't *characters*. Strings which contain surrogates are strictly speaking illegal, although some programming languages (including Python) allow them. The Unicode standard defines surrogates as follows: http://www.unicode.org/glossary/ - Surrogate Character. A misnomer. It would be an encoded character having a surrogate code point, which is impossible. Do not use this term. - Surrogate Code Point. A Unicode code point in the range U+D800..U+DFFF. Reserved for use by UTF-16, where a pair of surrogate code units (a high surrogate followed by a low surrogate) “stand in” for a supplementary code point. - Surrogate Pair. A representation for a single abstract character that consists of a sequence of two 16-bit code units, where the first value of the pair is a high-surrogate code unit, and the second is a low-surrogate code unit. (See definition D75 in Section 3.8, Surrogates.) http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf#G2630 So far so good, this is clear: surrogates are not characters. Surrogate pairs are only used by UTF-16 (since that's the only UTF which uses 16-bit code units). Suppose you read two code units (four bytes) in UTF-16 (big endian): b'\xd8\x02\xdd\x00' That could be ambiguous, as it could mean: - a single code point, U+10900 PHOENICIAN LETTER ALF, encoded as the surrogate pair, U+D802 U+DD00; - two surrogate code points, U+D802 followed by U+DD00. UTF-16 definitely rejects the second alternative and categorically makes only the first valid. To ensure that is the only valid interpretation, the second is explicitly disallowed: conforming Unicode strings are not allowed to include surrogate code points, regardless of whether you are intending to encode them in UTF-32 (where there is no ambiguity) or UTF-16 (where there is). Only UTF-16 encoded bytes are allowed to include surrogate code *units*, and only in pairs. However, Python (and other languages?) take a less strict approach. In Python 3.3 and better, the code point U+10900 (Phoenician Alf) is encoded using a single four-byte code unit: 0x00010900', so there's no ambiguity. That allows Python to encode surrogates as double-byte code units with no ambiguity: the interpreter can distinguish a single 32 byte number 0x00010900 from a pair of 16 bit numbers 0x0001 0x0900 and treat the first as Alf and the second as two surrogates. By the letter of the Unicode standard, it should not do this, but nevertheless it does and it appears to do no real harm and have some benefit. The glossary also says: - High-Surrogate Code Point. A Unicode code point in the range U+D800 to U+DBFF. (See definition D71 in Section 3.8, Surrogates.) - High-Surrogate Code Unit. A 16-bit code unit in the range D800_16 to DBFF_16, used in UTF-16 as the leading code unit of a surrogate pair. Also known as a leading surrogate. (See definition D72 in Section 3.8, Surrogates.) - Low-Surrogate Code Point. A Unicode code point in the range U+DC00 to U+DFFF. (See definition D73 in Section 3.8, Surrogates.) - Low-Surrogate Code Unit. A 16-bit code unit in the range DC00_16 to DFFF_16, used in UTF-16 as the trailing code unit of a surrogate pair. Also known as a trailing surrogate. (See definition D74 in Section 3.8, Surrogates.) So we can certainly talk about surrogates being code points: the code points U+D800 through U+DFFF inclusive are surrogate code points, but not characters. They're not "non-characters" either. Unicode includes exactly 66 code points formally defined as "non-characters": - Noncharacter. A code point that is permanently reserved for internal use. Noncharacters consist of the values U+nFFFE and U+nFFFF (where n is from 0 to 1016), and the values U+FDD0..U+FDEF. See the FAQ on Private-Use Characters, Noncharacters and Sentinels. http://www.unicode.org/faq/private_use.html#noncharacters So even though noncharacters (with or without the hyphen) are code points reserved for internal use, and surrogates are code points reserved for the internal use of the UTF-16 encoder, and surrogates are not characters, surrogates are not noncharacters. Naming things is hard. > Thus the range of code points that are available for use as > characters is U+0000–U+D7FF and U+E000–U+10FFFF (1,112,064 code > points). > > <URL: https://en.wikipedia.org/wiki/Unicode> That is correct for a strictly conforming Unicode implementation. >> py> low = '\uDC37' > > That should raise a SyntaxError exception. If Python was strictly conforming, that is correct, but it turns out there are some useful things you can do with strings if you allow surrogates. -- Steve “Cheer up,” they said, “things could be worse.” So I cheered up, and sure enough, things got worse. -- https://mail.python.org/mailman/listinfo/python-list