On Feb 12, 2016, at 4:42 PM, Scott Robison <scott at casaderobison.com> wrote:
> 
> I find it kind of interesting that Microsoft takes a lot
> of (deserved) flack for not adhering to standards, yet UTF-8 came about
> specifically because some didn't want to use UCS-2

?for good reason.  UCS-2/UTF-16 isn?t compatible with C strings.  I know you 
know this, but it?s a huge consideration.  Outside of Mac OS Classic and a few 
even smaller enclaves, C and its calling standards were the lingua franca of 
the computing world when Unicode first came on the scene, and those enclaves 
are now all but gone.

We?ll be living with the legacy of C for quite a long time yet.  Until C is 
completely stamped out, we?ll have to accommodate 0-terminated strings somehow.

> Had Microsoft come up with it first, I'm sure they'd be crucified by some of
> the same people who today are critical of them for using wide characters
> instead of UTF-8!

I think if we were to send a copy of the Unicode 8.0 standard back to the early 
1960s as a model for those designing ASCII, Unicode would look very different 
today.

I think the basic idea of UTF-8 would remain.  Instead of being sold as a 
C-compatible encoding, we?d still have a need for it as a packed encoding.  A 
kind of Huffman encoding for language, if you will.

But, I think we?d probably reorder the Unicode character points so that it 
packed even more densely on typical texts.  Several of the ASCII punctuation 
characters don?t deserve a place in the low 7 bits, and we could relocate the 
control characters, too.  We could probably get all of Western Europe?s 
characters into the lower 7 that way.

The next priority would be to pack the rest of the Western world?s characters 
into the lower 11 bits.  Cyrillic, Greek, Eastern European accented Latin 
characters, etc.

That should still leave space for several other non-Asian, non-Latin character 
sets.  Devanagari, Hebrew, Arabic?pack as many of them in as we can.  We should 
be able to cover about half the world?s population in the same space as UCS-2, 
while allowing most Western texts to be smaller, thoroughly outcompeting it.

UCS-2 feels like the 90?s version of ?640 kB is enough for everything!? to me, 
and UTF-16 like bank switching/segmentation.  We?re going to be stuck with 
those half-measure decisions for decades now.  Thanks, Microsoft.

The POSIX platforms did the right thing here: UTF-32 when speed matters more 
than space, and UTF-8 when space or compatibility matters more.

> Note: I still wish [Microsoft] supported UTF-8 directly from the API.

If wishes were changes, I?d rather that all languages and platforms supported 
tagged UTF-8 and UTF-32 strings, with automatic conversion as necessary.  Pack 
your strings down as UTF-8 when space matters, and unpack them as UTF-32 when 
speed matters.  Unicode could define a sensible conversion rule set, similar to 
the way sign extension works when mixing integer sizes.

Since the Unicode Consortium has stated that Unicode won?t grow beyond 2^21-1 
code points to prevent UTF-8 from going beyond 4 bytes per character, that tag 
could be an all-1s upper byte.  The rule could be that if you pass at least 4 
bytes to a function expecting a string, the buffer length is evenly divisible 
by 4, and the first 32-bit word has 0xFF on either end, it?s a tagged UTF-32 
value.  Otherwise, it?s UTF-8.

Simple and straightforward.

Too bad it will never happen.

Reply via email to