The fact is, once you dedicate the top bits in a pipe to some purposes, you've narrowed the width of the pipe. That's what happened to those systems that implemented a 7-bit pipe for ASCII by using the top bit for other purposes.

And everybody seems to agree that when you serialize such an encoding the 'unused' bits indeed do need to be set to 0. 0xFFF0FFFF is *not* the same as 0x0010FFFF. Only the second example is the correct UTF-32 value for the largest Unicode code point.

However, even strictly internal use of the lesser number of bits, though not illegal, or incorrect, can be *unwise*. It limits the ways such a system can be enabled for other character sets.

Now, while ASCII was something of a minimal character set, Unicode strives to be universal. The chances of getting burned by limiting your architecture to the features of a single character set are inversely proportional to its scope and coverage.

In an ideal world, Unicode would satisfy all needs, present and future, and you could build systems that can only ever deal with Unicode. And many such systems are being build and will work quite well. However, there's always a chance that someday some other coding system(*) may need to be used in parts of your system, and you may well be happy having kept your plumbing generically to 32-bit.

Call it engineer's caution, if you will.

A./




Reply via email to