One of the reasons why the Unicode Standard avoids the term “valid string”, is 
that it immediate begs the question, valid *for what*?

The Unicode string <U+0061, U+FFFF, U+0062> is just a sequence of 3 Unicode 
characters. It is valid *for* use in internal processing, because for my own 
processing I can decide I need to use the noncharacter value U+FFFF for some 
internal sentinel (or whatever). It is not, however, valid *for* open 
interchange, because there is no conformant way by the standard (by design) for 
me to communicate to you how to interpret U+FFFF in that string. However, the 
string <U+0061, U+FFFF, U+0062> is valid *as* a NFC-normalized Unicode string, 
because the normalization algorithm must correctly process all Unicode code 
points, including noncharacters.

The Unicode string <U+0061, U+E000, U+0062> contains a private use character 
U+E0000. That is valid *for* open interchange, but it is not interpretable 
according the standard itself. It requires an external agreement as to the 
interpretation of U+E000.

The Unicode string <U+0061, U+002A, U+0062> (“a*b”) is not valid *as* an 
identifier, because it contains a pattern-syntax character, the asterisk. 
However, it is certainly valid *for* use as an expression, for example.

And so on up the chain of potential uses to which a Unicode string could be put.

People (and particularly programmers) should not get too hung up on the notion 
of validity of a Unicode string, IMO. It is not some absolute kind of condition 
which should be tested in code with a bunch of assert() conditions every time a 
string hits an API. That way lies bad implementations of bad code. ;-)

Essentially, most Unicode string handling APIs just pass through string 
pointers (or string objects) the same way old ASCII-based programs passed 
around ASCII strings. Checks for “validity” are only done at points where they 
make sense, and where the context is available for determining what the 
conditions for validity actually are. For example, a character set conversion 
API absolutely should be checking for ill-formedness for UTF-8, for example, 
and have appropriate error-handling, as well as checking for uninterpretable 
conversions (mapping not in the table), again with appropriate error-handling.

But, on the other hand, an API which converts Unicode strings between UTF-8 and 
UTF-16, for example, absolutely should not – must not – concern itself with the 
presence of a defective combining character sequence. If it doesn’t convert the 
defective combining character sequence in UTF-8 into the corresponding 
defective combining character sequence in UTF-16, then the API is just broken. 
Never mind the fact that the defective combining character sequence itself 
might not then be valid *for* some other operation, say a display algorithm 
which detects that as an unacceptable edge condition and inserts a virtual base 
for the combining mark in order not to break the display.

--Ken




What does it mean to not be a valid string in Unicode?

Is there a concise answer in one place? For example, if one uses the 
noncharacters just mentioned by Ken Whistler ("intended for process-internal 
uses, but [...] not permitted for interchange"), what precisely does that mean? 
Naively, all strings over the alphabet {U+0000, ..., U+10FFFF} seem "valid", 
but section 16.7 clarifies that noncharacters are "forbidden for use in open 
interchange of Unicode text data". I'm assuming there is a set of 
isValidString(...)-type ICU calls that deals with this? Yes, I'm sure this has 
been asked before and ICU documentation has an answer, but this page
    http://www.unicode.org/faq/utf_bom.html
contains lots of distributed factlets where it's imo unclear how to add them 
up. An implementation can use characters that are "invalid in interchange", but 
I wouldn't expect implementation-internal aspects of anything to be subject to 
any standard in the first place (so, why write this?). Also it makes me wonder 
about the runtime of the algorithm checking for valid Unicode strings of a 
particular length. Of course the answer is "linear" complexity-wise, but as it 
or a variation of it (depending on how one treats holes and noncharacters) will 
be dependent on the positioning of those special characters, how fast does this 
function perform in practice? This also relates to Markus Scherer's reply to 
the "holes" thread just now.

Stephan

Reply via email to