On Sat, 9 May 2015 02:26:59 +0200 Daniel Bünzli <[email protected]> wrote:
> Le samedi, 9 mai 2015 à 00:37, Doug Ewell a écrit : > > Noncharacters are Unicode scalar values, > (However noncharacters are not designed to be openly interchanged see > "Restricted interchange" on p. 31. of 7.0.0) That didn't stop their being openly interchanged. > > They may both be part of a "Unicode string" which does not claim to > > be in any given encoding form. > Not sure what you mean by that. So I let someone else answer. There are a number of phrases whose declared meanings cannot be deduced from the individual words. A UTF-8, UTF-16 or UTF-32 string defines a sequence of scalar values. However, Unicode 8-bit, 16-bit or 32-bit string is merely a sequence of 8-bit, 16-bit or 32-bit values that may occur in a UTF-8, UTF-16 or UTF-32 string respectively. This definition has some odd consequences: A Unicode 32-bit string is a UTF-32 string, for UTF-32 is not a multi-word encoding. An arbitrary string of unsigned 32-bit values is not in general a Unicode 32-bit string. All strings of unsigned 16-bit values are Unicode 16-bit strings. Not all (Unicode) 16-bit strings are UTF-16 strings. Not all strings of unsigned 8-bit values are Unicode 8-bit strings, and not all Unicode 8-bit strings are UTF-8 strings. I can't think of a practical use for the specific concepts of Unicode 8-bit, 16-bit and 32-bit strings. Unicode 16-bit strings are essentially the same as 16-bit strings, and Unicode 32-bit strings are UTF-32 strings. 'Unicode 8-bit string' strikes me as an exercise in pedantry; there are more useful categories of 8-bit strings that are not UTF-8 strings. Richard.

