To assess whether a string is invalid, it all depends on what the string is supposed to be.
1. As Ken says, if a string is supposed to be in a given encoding form (UTF), but it consists of an ill-formed sequence of code units for that encoding form, it would be invalid. So an isolated surrogate (eg 0xD800) in UTF-16 or any surrogate (eg 0x0000D800) in UTF-32 would make the string invalid. For example, a Java String may be an invalid UTF-16 string. See http://www.unicode.org/glossary/#unicode_encoding_form 2. However, a "Unicode X-bit string" does not have the same restrictions: it may contain sequences that would be ill-formed in the corresponding UTF-X encoding form. So a Java String is always a valid Unicode 16-bit string. See http://www.unicode.org/glossary/#unicode_string 3. Noncharacters are also valid in interchange, depending on the sense of "interchange". The TUS says ""In effect, noncharacters can be thought of as application-internal private-use code points." If I couldn't interchange them ever, even internal to my application, or between different modules that compose my application, they'd be pointless. They are, however, strongly discouraged in *public* interchange. The glossary entry and some of the standard text is a bit old here, and needs to be clarified. 4. The quotation "we select a substring that begins with a combining character, this new string will not be a valid string in Unicode." is wrong. It *is* a valid Unicode string. It isn't particularly useful in isolation, but it is valid. For some *specific purpose*, any particular string might be invalid. For example, the string mark#d might be invalid in some systems as a password, where # is disallowed, or where passwords might be required to be 8 characters long. Mark <https://plus.google.com/114199149796022210033> * * *— Il meglio è l’inimico del bene —* ** On Fri, Jan 4, 2013 at 3:10 PM, Stephan Stiller <stephan.stil...@gmail.com>wrote: > > A Unicode string in UTF-8 encoding form could be ill-formed if the bytes >> don't follow the specification for UTF-8, for example. >> > Given that answer, add "in UTF-32" to my email just now, for simplicity's > sake. Or let's simply assume we're dealing with some sort of sequence of > abstract integers from hex+0 to hex+10FFFF, to abstract away from "encoding > form" issues. > > Stephan > > >