On Fri, Feb 12, 2016 at 8:05 PM, Warren Young <wyml at etr-usa.com> wrote:
> On Feb 12, 2016, at 4:42 PM, Scott Robison <scott at casaderobison.com> > wrote: > > > > I find it kind of interesting that Microsoft takes a lot > > of (deserved) flack for not adhering to standards, yet UTF-8 came about > > specifically because some didn't want to use UCS-2 > > ?for good reason. UCS-2/UTF-16 isn?t compatible with C strings. I know > you know this, but it?s a huge consideration. Outside of Mac OS Classic > and a few even smaller enclaves, C and its calling standards were the > lingua franca of the computing world when Unicode first came on the scene, > and those enclaves are now all but gone. > > We?ll be living with the legacy of C for quite a long time yet. Until C > is completely stamped out, we?ll have to accommodate 0-terminated strings > somehow. > UCS (which was by definition a 2 byte encoding; UCS-2 is a retronym) was not a "standard" until late 1991. C89/C90 provided for definition of a wide character type. People didn't want to (perhaps couldn't) use it (and I can understand why). My point was just that Microsoft was the first to really embrace the standard as written, not tweak it into something else. Windows bought into the idea of Unicode and/or UCS before they were unified and standardized to their current form. That locked Windows into what we call the UCS-2 format, when Unicode was "guaranteed" to never need more than 2^16 code points. Later unification of the two standards expanded the potential code point space to U+7FFFFFFF, and later still restricted it to U+10FFFF to ensure that UTF-16 could address all of the potential standard code points. > > Had Microsoft come up with it first, I'm sure they'd be crucified by > some of > > the same people who today are critical of them for using wide characters > > instead of UTF-8! > > I think if we were to send a copy of the Unicode 8.0 standard back to the > early 1960s as a model for those designing ASCII, Unicode would look very > different today. > I think you're probably correct. Though who knows. The industry still hadn't really agreed to 8 bit bytes. Memory was expensive, and you did what you had to to minimize its use. 6 bit bytes/characters, 2 digit year encodings. A lot of people today just can't imagine caring that much about RAM (given how much of it is used to share pictures of kittens), but it was a significant savings that translated to real money. > UCS-2 feels like the 90?s version of ?640 kB is enough for everything!? to > me, and UTF-16 like bank switching/segmentation. We?re going to be stuck > with those half-measure decisions for decades now. Thanks, Microsoft. > Thanks Unicode / ISO-10646. They set the standard. Microsoft adopted it. > The POSIX platforms did the right thing here: UTF-32 when speed matters > more than space, and UTF-8 when space or compatibility matters more. > They had the luxury of waiting until UTF-8 and UCS-4 (now UTF-32) existed before making those decisions. 20/20 hindsight. Note: I like UTF-8. I try to use it everywhere and only convert as needed to suit the API. I certainly think Microsoft has had plenty of time to more thoroughly integrate UTF-8 into the APIs so that you don't have to convert back and forth. I just find it funny that Microsoft is condemned by so many for adhering to the standards / draft standards while POSIX systems were able to embrace and extend. :) -- Scott Robison