> If for example I sit on a committee that devises a new encoding form, I > would need to be concerned with the question which *sequences of Unicode > code points* are sound. If this is the same as "sequences of Unicode > scalar values", I would need to exclude surrogates, if I read the standard > correctly (this wasn't obvious to me on first inspection btw). If for > example I sit on a committee that designs an optimized compression > algorithm for Unicode strings (yep, I do know about SCSU), I might want to > first convert them to some canonical internal form (say, my array of > non-negative integers). If U+<surrogate values> can be assumed to not > exist, there are 2048 fewer values a code point can assume; that's good for > compression, and I'll subtract 2048 from those large scalar values in a > first step. Etc etc. So I do think there are a number of very general use > cases where this question arises. >
In fact, these questions have arisen in the past and have found answers then. A present-day use case is if I author a programming language and need to decide which values for <val> I accept in a statement like this: someEncodingFormIndependentUnicodeStringType str = <val, specified in some PL-specific way> I've looked at the Standard, and I must admit I'm a bit perplexed. Because of C1, which explicitly states A process shall not interpret a high-surrogate code point or a low-surrogate code point as an abstract character. I do not know why surrogate values are defined as "code points" in the first place. It seems to me that surrogates are (or should be) an encoding form–specific notion, whereas I have always thought of code points as encoding form–independent. Turns out this was wrong. I have always been thinking that "code point" conceptually meant "Unicode scalar value", which is explicitly forbidden to have a surrogate value. Is this only terminological confusion? I would like to ask: Why do we need the notion of a "surrogate code point"; why isn't the notion of "surrogate code units [in some specific encoding form]" enough? Conceptually surrogate values are byte sequences used in encoding forms (modulo endianness). Why would one define an expression ("Unicode code point") that conceptually lumps "Unicode scalar value" (an encoding form–independent notion) and "surrogate code point" (a notion that I wouldn't expect to exist outside of specific encoding forms) together? An encoding form maps only Unicode scalar values (that is all Unicode code points excluding the "surrogate code points"), by definition. D80 and what follows ("Unicode string" and "Unicode X-bit string") exist, as I understand it, *only* in order for us to be able to have terminology for discussing ill-formed code unit sequences in the various encoding forms; but all of this talk seems to me to be encoding form–dependent. I think the answer to the question I had in mind is that the legal sequences of Unicode scalar values are (by definition) ({U+0000, ..., U+10FFFF} \ {U+D800, ..., U+DFFF})* . But then there is the notion of "Unicode string", which is conceptually different, by definition. Maybe this is a terminological issue only. But is there an expression in the Standard that is defined as "sequence of Unicode scalar values", a notion that seems to me to be conceptually important? I can see that the Standard defines the various "well-formed <encoding form> code unit sequence". Have I overlooked something? Why is it even possible to store a surrogate value in something like the icu::UnicodeString datatype? In other words, why are we concerned with storing Unicode *code points* in data structures instead of Unicode *scalar values* (which can be serialized via encoding forms)? Stephan