Thanks for all the information.

Is there a most general sense in which there are constraints beyond all characters being from within the range U+0000 ... U+10FFFF? If one is concerned with computer security, oddities that are absolute should raise a flag; somebody could be messing with my system. Perhaps, for internal purposes, I have stored my Unicode string in an array of non-negative integers, and now I'm passing around this array. I don't know anything else about that string besides it being a Unicode string. There are no /absolute/ constraints against having any of those 1114112_dec (110000_hex) code points appearing anywhere, correct? Oh wait, actually there are the surrogates (D800 ... DFFF); perhaps I need to exclude them. So what else might I have overlooked? For example, the original C datatype named "string", as it is understood and manipulated by the C standard library, has an /absolute/ prohibition against U+0000 anywhere inside. UTF-32 has an /absolute/ prohibition against anything above 10FFFF. UTF-16 has an /absolute/ prohibition against broken surrogate pairs. (Or so is my understanding. Mark Davis mentioned "Unicode X-bit strings", but D76 (in sec. 3.9 of the standard) suggests that there is no place for surrogate values outside of an encoding form; that is: a surrogate is not a "Unicode scalar value". Perhaps "Unicode X-bit string" should be outside of this discussion then, or I'll need to read up on this more.)

Mark Davis' quote ("In effect, noncharacters can be thought of as application-internal private-use code points.") would really suggest that there are really no absolute constraints. I'm just checking that my understanding of the matter is correct.

Stephan

Reply via email to