That's a valid computation if the extension was limited to use only 2-surrogate encodings for supplementary planes.
If we could use 3-surrogate encodings, you'd need 3*2ˆn surrogates to encode 2^(3*n) new codepoints. With n=10 (like today), this requires a total of 3072 surrogates, and you encode 2^30 new codepoints. This is still possible today, even if the BMP is almost full and won't allow a new range of 1024 surrogates: you can still use 2 existing surrogates to encode 2048 "hyper-surrogates" in the special plane 16 (or for private use in the private planes 14 and 15), which will combine with the existing low surrogates in the BMP. This is not complicate to parse it in the foreward direction, but for the backward direction, it means that when you see the final low surrogate, you still need to rollback to the previous position: it can only be a leading high surrogate of the BMP, **or** (this is would be new) another low surrogate encoding, for which you must still get back to get the leading high surrogate. This requires a test if starting from a random position, but at least it remains possible to know where is the leading high surrogate. One problem of this scheme is that it is not compatible with UTF-16 because you would find a sequence like: <HIGH SURROGATE #1 OF THE BMP, LOW SURROGATE #2 OF THE BMP, LOW SURROGATE #3 OF THE BMP> which UTF-16 would parse as: <VALID SUPPLEMENTARY CODEPOINT FROM SURROGATES(#1,#2), LOW SURROGATE #3 OF THE BMP> The first code point is valid, but for UTF-16 working in strict mode, the trailing low surrogate is isolated. It generates an exception (encoding error). But this exception could be handled by varifyng that this isolated low surrogate follows a codepoint assigned to one of the 2048 "hyper-surrogates" allocated in plane 17, or privately in planes 15 or 16, in order to encode only private-use codepoints). This would no longer be valid UTF-16, but something else (say "UTF-X16"). The **current bet** is that such mechanism will **never** be needed for encoding standard codepoints (which will all fit in the existing 17 planes (even if 4 of them are almost full and a 5th one will be filled significantly for sinograms and a 6th one is allocated only for special codepoints but almost empty), only for encoding more private-use codepoints. But then, if this need is only for encoding many new private codepoints, why would we need to encode the final surrogate in the standard range ? You can do the same thing by allocating the final surrogate in the private use area of the BMP for that usage. Or equivalently by allocating the 3 ranges of 1024 private-use surrogates directly in the BMP; the PUA area of the BMP is large enough to encode these needed 3072 private-use surrogates to support 2ˆ30 new private-use codepoints, and it does not require any modification to the existing UTF-16. In other words, there's no limitation in the number of codepoints for private use you can encode with UTF-16. We however depend on the decision that the 17 planes will be enough for all standard uses (otherwise a new UTF like UTF-X16 above may be standardized, with limited compatibility with UTF-16). 2012/11/27 "Martin J. Dürst" <due...@it.aoyama.ac.jp> > Well, first, it is 17 planes (or have we switched to using hexadecimal > numbers on the Unicode list already? > > Second, of course this is in connection with UTF-16. I wasn't involved > when UTF-16 was created, but it must have become clear that 2^16 (^ denotes > exponentiation ("to the power of")) codepoints (UCS-2) wasn't going to be > sufficient. Assuming a surrogate-like extension mechanism, with high > surrogates and low surrogates separated for easier synchronization, one > needs > > 2 * 2^n > surrogate-like codepoints to create > > 2^(2*n) > new codepoints. > > For doubling the number of codepoints (i.e. a total of 2 planes), one > would use n=8, and so one needs 128 surrogate-like codepoints. With n=9, > one gets 4 more planes for a total of 5 planes, and needs 512 > surrogate-like codepoints. With n=10, one gets 16 more planes (for the > current total of 17), but needs 2048 surrogate codepoints. With n=11, one > would get 64 more planes for a total of 65 planes, but would need 8192 > codepoints. And so on. > > My guess is that when this was considered, 1,048,576 codepoints was > thought to be more than enough, and giving up 8192 codepoints in the BMP > was no longer possible. As an additional benefit, the 17 planes fit nicely > into 4 bytes in UTF-8. > > Regards, Martin. > > On 2012/11/26 19:47, Shriramana Sharma wrote: > >> I'm sorry if this info is already in the Unicode website or book, but >> I searched and couldn't find it in a hurry. >> >> When extending beyond the BMP and the maximum range of 16-bit >> codepoints, why was it chosen to go upto 10FFFF and not any more or >> less? Wouldn't FFFFF have been the next logical stop beyond FFFF, even >> if FFFFFF (or FFFFFFFF) is considered too big? (I mean, I'm not sure >> how that extra 64Ki chars [10FFFF minus FFFFF] could be important...) >> >> Thanks. >> >> >