On Fri, Feb 12, 2016 at 8:05 PM, Warren Young <wyml at etr-usa.com> wrote:

> On Feb 12, 2016, at 4:42 PM, Scott Robison <scott at casaderobison.com>
> wrote:
> >
> > I find it kind of interesting that Microsoft takes a lot
> > of (deserved) flack for not adhering to standards, yet UTF-8 came about
> > specifically because some didn't want to use UCS-2
>
> ?for good reason.  UCS-2/UTF-16 isn?t compatible with C strings.  I know
> you know this, but it?s a huge consideration.  Outside of Mac OS Classic
> and a few even smaller enclaves, C and its calling standards were the
> lingua franca of the computing world when Unicode first came on the scene,
> and those enclaves are now all but gone.
>
> We?ll be living with the legacy of C for quite a long time yet.  Until C
> is completely stamped out, we?ll have to accommodate 0-terminated strings
> somehow.
>

UCS (which was by definition a 2 byte encoding; UCS-2 is a retronym) was
not a "standard" until late 1991. C89/C90 provided for definition of a wide
character type. People didn't want to (perhaps couldn't) use it (and I can
understand why). My point was just that Microsoft was the first to really
embrace the standard as written, not tweak it into something else.

Windows bought into the idea of Unicode and/or UCS before they were unified
and standardized to their current form. That locked Windows into what we
call the UCS-2 format, when Unicode was "guaranteed" to never need more
than 2^16 code points. Later unification of the two standards expanded the
potential code point space to U+7FFFFFFF, and later still restricted it to
U+10FFFF to ensure that UTF-16 could address all of the potential standard
code points.


> > Had Microsoft come up with it first, I'm sure they'd be crucified by
> some of
> > the same people who today are critical of them for using wide characters
> > instead of UTF-8!
>
> I think if we were to send a copy of the Unicode 8.0 standard back to the
> early 1960s as a model for those designing ASCII, Unicode would look very
> different today.
>

I think you're probably correct. Though who knows. The industry still
hadn't really agreed to 8 bit bytes. Memory was expensive, and you did what
you had to to minimize its use. 6 bit bytes/characters, 2 digit year
encodings. A lot of people today just can't imagine caring that much about
RAM (given how much of it is used to share pictures of kittens), but it was
a significant savings that translated to real money.


> UCS-2 feels like the 90?s version of ?640 kB is enough for everything!? to
> me, and UTF-16 like bank switching/segmentation.  We?re going to be stuck
> with those half-measure decisions for decades now.  Thanks, Microsoft.
>

Thanks Unicode / ISO-10646. They set the standard. Microsoft adopted it.


> The POSIX platforms did the right thing here: UTF-32 when speed matters
> more than space, and UTF-8 when space or compatibility matters more.
>

They had the luxury of waiting until UTF-8 and UCS-4 (now UTF-32) existed
before making those decisions. 20/20 hindsight.

Note: I like UTF-8. I try to use it everywhere and only convert as needed
to suit the API. I certainly think Microsoft has had plenty of time to more
thoroughly integrate UTF-8 into the APIs so that you don't have to convert
back and forth. I just find it funny that Microsoft is condemned by so many
for adhering to the standards / draft standards while POSIX systems were
able to embrace and extend. :)

-- 
Scott Robison

Reply via email to