2015-05-09 5:13 GMT+02:00 Richard Wordingham < [email protected]>:
> I can't think of a practical use for the specific concepts of Unicode > 8-bit, 16-bit and 32-bit strings. Unicode 16-bit strings are > essentially the same as 16-bit strings, and Unicode 32-bit strings are > UTF-32 strings. 'Unicode 8-bit string' strikes me as an exercise in > pedantry; there are more useful categories of 8-bit strings that are > not UTF-8 strings. > And here you're wrong: a 16-bit string is just a sequence of arbitrary 16-bit code units, but an Unicode string (whatever the size of its code units) adds restrictions for validity (the only restriction being in fact that surrogates (when present in 16-bit strings, i.e. UTF-16) must be paired, and in 32-bit (UTF-32) and 8-bit (UTF-8) strings, surrogates are forbidden. So the concept of "Unicode string" is in fact the same as valid Unicode text: it is a subset of possible strings, restricted by validation rules: - for 8-bit strings (UTF-8) there are other constraints (not all bytes are acceptable and some pairs of bytes are also restricted, and final bytes cannot occur alone) - for 16-bit strings (UTF-16), the only constraint is on isolated/unpaired surrogates - for 32-bit strings (UTF-32), the only constaint is on the two allowed ranges of encoded code points (U+0000..U+D7FF and U+E000..U+10FFFF). For being "plain-text" there are additional restrictions: non-characters are also excluded, and only a small subset of controls (basically tabs and newlines) is allowed (the other controls, including U+0000 are restricted for private protocols and not designed for plain text... except specifically in a few legacy encoded 8-bit "charsets" like VISCII or ISO 2022 or Videotext which need these controls in fact to represent characters into sequences, possibly with contextual encoding).

