William Overington imagined: > When thinking about using surrogate pairs of 16 bit unicode > characters to express a 21 bit unicode character I like to > think in terms of an analogy of a Medieval Great Field > divided into strips for cultivation. That's what freedom of thought is for: allowing people to think in the terms they prefer. :-) William Overington asked: > Suppose that one has a document [...] that consists of a > sequence of unicode characters that are each more than > 16 bits [...] all of the characters are located in the > same strip of the great field. Suppose that there are n > characters [...] > Would the sequence of sixteen bit characters contain > 2n or n+1 characters or some other number? You always restart from the king's manor and walk down the central street each time. That is, each character has its high and low surrogates, even if the high surrogate is always the same for all characters. So you need 2n "code units" (not "characters") to encode n characters. As you said, both approaches have their advantages and disadvantages. The method that you suggest (that would be called a "shifted encoding", and is actually used in some Far East double-byte encodings) is clearly more economic, in terms of memory usage, but is very vulnerable. The weak point is the high surrogate code unit which determines the interpretation of a whole sequence of low surrogates code unit. You can imagine what happens if *that* very code unit gets corrupted! Your whole novel could become garbage, because the high bits of each characters would be wrong. On the other hand, Unicode's method (which is called UTF-16, by the way) may be considered redundant. But it is exactly this redundancy that makes it much more secure. In fact, if one code unit gets corrupted (either a high surrogate, a low surrogate, or a standalone code unit) it is guaranteed that exactly *one* character will be corrupted. See the UTR#17 (http://www.unicode.org/unicode/reports/tr17) for more details. Hoping this helps. Marco P.S. I was surprised by your mail because, by coincidence, I have been reasoning along similar lines for a few days (well, although without Mediaeval feuds), balancing the pros and cons of the two methods. The tentative conclusion that I came to is that an hypothetical alternative approach should offer a vvvvvvvery big economy of memory to repay the security features offered by existing UTF's.