Re: High-Speed UTF-8 to UTF-16 Conversion

Ben Wiley Sittler Sat, 17 Mar 2007 08:50:21 -0800

just a hypothesis, but i'm guessing that at the time they put this
together, both major platforms (win32 and java) dealing with DOM used
ucs-2 (and now use utf-16) internally.


even today, win32 and java mostly do not use utf-8. the only form
widely supported outside of linux and unix systems is utf-8 with a
fictitious "byte order mark" (obviously as a byte-oriented encoding
this is useless) which is of course incompatible with tools used on
unix and linux, and with many web browsers. Notepad uses this form,
and Java uses a bunch of incompatible utf-8 "extensions" in its
serializations (incorrect encoding of NUL and incorrect encoding of
plane 1 ... plane 16 using utf-8 sequences corresponding to individual
surrogate codes). unfortunately this is perpetuated in several network
protocols, and e.g. is what one does when interfacing to Oracle or
MySQL.

even on mac os x, where it's the encoding used for the unix-type
filesystem access, it's still not the default text encoding in
TextEdit, and utf-8 text files "don't work" (i.e. they open as
MacRoman or whatever Mac* encoding is paired with the OS language.)
fortunately this si configurable, unfortunately changing it breaks all
sorts of other stuff (apps frequently still ship with macroman README
files, etc.)

so basically, if you want it to work i recommend switching to linux,
unix, plan 9, or similar :(

On 3/17/07, Christopher Fynn <[EMAIL PROTECTED]> wrote:

Colin Paul Adams wrote:

>>>>>> "Rich" == Rich Felker <[EMAIL PROTECTED]> writes:
>
>     Rich> Indeed, this was what I was thinking of. Thanks for
>     Rich> clarifying. BTW, any idea WHY they brought the UTF-16
>     Rich> nonsense to DOM/DHTML/etc.?

> I don't know for certain, but I can speculate well, I think.

> DOM was a micros**t invention (and how it shows!). NT was UCS-2
> (effectively).

AFAIK Unicode was originally only planned to be a 16-bit encoding.
the The Unicode Consortium and ISO 10646 then agreed to synchronize the
two standards - though originally Unicode was only going to be a 16-bit
subset of the UCS. A little after that Unicode decided to support UCS
characters beyond plane 0.

Anyway at the time NT was being designed (late eighties) Unicode was
supposed to be limited to < 65536 characers and UTF-8 hadn't been
thought of, so 16-bits probably seemed like a good idea.

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/


--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: High-Speed UTF-8 to UTF-16 Conversion

Reply via email to