RE: What does it mean to "not be a valid string in Unicode"?

Whistler, Ken Fri, 04 Jan 2013 15:01:09 -0800

Yannis' use of the terminology "not ... a valid string in Unicode" is a little 
confusing there.


A Unicode string with the sequence, say, <U+0300, U+0061> (a combining grave 
mark, followed by "a"), is  "valid" Unicode in the sense that it just consists 
of two Unicode characters in a sequence. It is aberrant, certainly, but the way 
to describe that aberrancy is that the string starts with a defective combining 
character sequence (a combining mark, with no base character to apply to). And 
it would be non-conformant to the standard to claim that that sequence actually 
represented (or was equivalent to) the Latin small letter a-grave. ("à")

There is a second potential issue, which is whether any particular Unicode 
string is "ill-formed" or not. That issue comes up when examining actual code 
units laid out in memory in a particular encoding form. A Unicode string in 
UTF-8 encoding form could be ill-formed if the bytes don't follow the 
specification for UTF-8, for example. That is a separate issue from whether the 
string starts with a defective combining character sequence.

For "defective combining character sequence", see D57 in the standard. (p. 81)

For "ill-formed", see D84 in the standard. (p. 91)

http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf

--Ken

> In the book, Fonts & Encodings (p. 61, first paragraph) it says:
> 
>     ... we select a substring that begins
>     with a combining character, this new
>     string will not be a valid string in
>      Unicode.
> 
> What does it mean to not be a valid string in Unicode?
> 
> /Roger
>

RE: What does it mean to "not be a valid string in Unicode"?

Reply via email to