On Feb 12, 2016, at 4:42 PM, Scott Robison <scott at casaderobison.com> wrote: > > I find it kind of interesting that Microsoft takes a lot > of (deserved) flack for not adhering to standards, yet UTF-8 came about > specifically because some didn't want to use UCS-2
?for good reason. UCS-2/UTF-16 isn?t compatible with C strings. I know you know this, but it?s a huge consideration. Outside of Mac OS Classic and a few even smaller enclaves, C and its calling standards were the lingua franca of the computing world when Unicode first came on the scene, and those enclaves are now all but gone. We?ll be living with the legacy of C for quite a long time yet. Until C is completely stamped out, we?ll have to accommodate 0-terminated strings somehow. > Had Microsoft come up with it first, I'm sure they'd be crucified by some of > the same people who today are critical of them for using wide characters > instead of UTF-8! I think if we were to send a copy of the Unicode 8.0 standard back to the early 1960s as a model for those designing ASCII, Unicode would look very different today. I think the basic idea of UTF-8 would remain. Instead of being sold as a C-compatible encoding, we?d still have a need for it as a packed encoding. A kind of Huffman encoding for language, if you will. But, I think we?d probably reorder the Unicode character points so that it packed even more densely on typical texts. Several of the ASCII punctuation characters don?t deserve a place in the low 7 bits, and we could relocate the control characters, too. We could probably get all of Western Europe?s characters into the lower 7 that way. The next priority would be to pack the rest of the Western world?s characters into the lower 11 bits. Cyrillic, Greek, Eastern European accented Latin characters, etc. That should still leave space for several other non-Asian, non-Latin character sets. Devanagari, Hebrew, Arabic?pack as many of them in as we can. We should be able to cover about half the world?s population in the same space as UCS-2, while allowing most Western texts to be smaller, thoroughly outcompeting it. UCS-2 feels like the 90?s version of ?640 kB is enough for everything!? to me, and UTF-16 like bank switching/segmentation. We?re going to be stuck with those half-measure decisions for decades now. Thanks, Microsoft. The POSIX platforms did the right thing here: UTF-32 when speed matters more than space, and UTF-8 when space or compatibility matters more. > Note: I still wish [Microsoft] supported UTF-8 directly from the API. If wishes were changes, I?d rather that all languages and platforms supported tagged UTF-8 and UTF-32 strings, with automatic conversion as necessary. Pack your strings down as UTF-8 when space matters, and unpack them as UTF-32 when speed matters. Unicode could define a sensible conversion rule set, similar to the way sign extension works when mixing integer sizes. Since the Unicode Consortium has stated that Unicode won?t grow beyond 2^21-1 code points to prevent UTF-8 from going beyond 4 bytes per character, that tag could be an all-1s upper byte. The rule could be that if you pass at least 4 bytes to a function expecting a string, the buffer length is evenly divisible by 4, and the first 32-bit word has 0xFF on either end, it?s a tagged UTF-32 value. Otherwise, it?s UTF-8. Simple and straightforward. Too bad it will never happen.