Some of this is simply historical: had Unicode been designed from the start with 8 and 16 bit forms in mind, some of this could be avoided. But that is water long under the bridge. Here is a simple example of why we have both UTFs and Unicode Strings.
Java uses Unicode 16-bit Strings. The following code is copying all the code units from string to buffer. StringBuilder buffer = new StringBuilder(); for (int i = 0; i < string.length(); ++i) { buffer.append(i.charAt(i)); } If Java always enforced well-formedness of strings, then 1. The above code would break, since there is an intermediate step where buffer is ill-formed (when just the first of a surrogate pair has been copied). 2. It would involve extra checks in all of the low-level string code, with some impact on performance. Newer implementations of strings, such as Python's, can avoid these issues because they use a Uniform Model, always dealing in code points. For more information, see also http://macchiati.blogspot.com/2012/07/unicode-string-models-many-programming.html (There are many, many discussions of this in the Unicode email archives if you have more questions.) Mark <https://plus.google.com/114199149796022210033> * * *— Il meglio è l’inimico del bene —* ** On Sat, Jan 5, 2013 at 11:14 PM, Stephan Stiller <stephan.stil...@gmail.com>wrote: > > If for example I sit on a committee that devises a new encoding form, I >> would need to be concerned with the question which *sequences of Unicode >> code points* are sound. If this is the same as "sequences of Unicode >> scalar values", I would need to exclude surrogates, if I read the standard >> correctly (this wasn't obvious to me on first inspection btw). If for >> example I sit on a committee that designs an optimized compression >> algorithm for Unicode strings (yep, I do know about SCSU), I might want to >> first convert them to some canonical internal form (say, my array of >> non-negative integers). If U+<surrogate values> can be assumed to not >> exist, there are 2048 fewer values a code point can assume; that's good for >> compression, and I'll subtract 2048 from those large scalar values in a >> first step. Etc etc. So I do think there are a number of very general use >> cases where this question arises. >> > > In fact, these questions have arisen in the past and have found answers > then. A present-day use case is if I author a programming language and need > to decide which values for <val> I accept in a statement like this: > someEncodingFormIndependentUnicodeStringType str = <val, specified in > some PL-specific way> > > I've looked at the Standard, and I must admit I'm a bit perplexed. Because > of C1, which explicitly states > > A process shall not interpret a high-surrogate code point or a > low-surrogate code point as an abstract character. > > I do not know why surrogate values are defined as "code points" in the > first place. It seems to me that surrogates are (or should be) an encoding > form–specific notion, whereas I have always thought of code points as > encoding form–independent. Turns out this was wrong. I have always been > thinking that "code point" conceptually meant "Unicode scalar value", which > is explicitly forbidden to have a surrogate value. Is this only > terminological confusion? I would like to ask: Why do we need the notion of > a "surrogate code point"; why isn't the notion of "surrogate code units [in > some specific encoding form]" enough? Conceptually surrogate values are > byte sequences used in encoding forms (modulo endianness). Why would one > define an expression ("Unicode code point") that conceptually lumps > "Unicode scalar value" (an encoding form–independent notion) and "surrogate > code point" (a notion that I wouldn't expect to exist outside of specific > encoding forms) together? > > An encoding form maps only Unicode scalar values (that is all Unicode code > points excluding the "surrogate code points"), by definition. D80 and what > follows ("Unicode string" and "Unicode X-bit string") exist, as I > understand it, *only* in order for us to be able to have terminology for > discussing ill-formed code unit sequences in the various encoding forms; > but all of this talk seems to me to be encoding form–dependent. > > I think the answer to the question I had in mind is that the legal > sequences of Unicode scalar values are (by definition) > ({U+0000, ..., U+10FFFF} \ {U+D800, ..., U+DFFF})* . > But then there is the notion of "Unicode string", which is conceptually > different, by definition. Maybe this is a terminological issue only. But is > there an expression in the Standard that is defined as "sequence of Unicode > scalar values", a notion that seems to me to be conceptually important? I > can see that the Standard defines the various "well-formed <encoding form> > code unit sequence". Have I overlooked something? > > Why is it even possible to store a surrogate value in something like the > icu::UnicodeString datatype? In other words, why are we concerned with > storing Unicode *code points* in data structures instead of Unicode *scalar > values* (which can be serialized via encoding forms)? > > Stephan > >