Re: What does it mean to "not be a valid string in Unicode"?

Stephan Stiller Fri, 04 Jan 2013 18:15:47 -0800

Thanks for all the information.

Is there a most general sense in which there are constraints beyond allcharacters being from within the range U+0000 ... U+10FFFF? If one isconcerned with computer security, oddities that are absolute shouldraise a flag; somebody could be messing with my system. Perhaps, forinternal purposes, I have stored my Unicode string in an array ofnon-negative integers, and now I'm passing around this array. I don'tknow anything else about that string besides it being a Unicode string.There are no /absolute/ constraints against having any of those1114112_dec (110000_hex) code points appearing anywhere, correct? Ohwait, actually there are the surrogates (D800 ... DFFF); perhaps I needto exclude them. So what else might I have overlooked? For example, theoriginal C datatype named "string", as it is understood and manipulatedby the C standard library, has an /absolute/ prohibition against U+0000anywhere inside. UTF-32 has an /absolute/ prohibition against anythingabove 10FFFF. UTF-16 has an /absolute/ prohibition against brokensurrogate pairs. (Or so is my understanding. Mark Davis mentioned"Unicode X-bit strings", but D76 (in sec. 3.9 of the standard) suggeststhat there is no place for surrogate values outside of an encoding form;that is: a surrogate is not a "Unicode scalar value". Perhaps "UnicodeX-bit string" should be outside of this discussion then, or I'll need toread up on this more.)

Mark Davis' quote ("In effect, noncharacters can be thought of asapplication-internal private-use code points.") would really suggestthat there are really no absolute constraints. I'm just checking that myunderstanding of the matter is correct.


Stephan

Re: What does it mean to "not be a valid string in Unicode"?

Reply via email to