In a message dated 2001-06-18 12:56:47 Pacific Daylight Time,
[EMAIL PROTECTED] writes:
> As matter of fact, Oracle supported UTF-8 far earlier than surrogate or
4-byte
> encoding was introduced. As database vendor, Oracle took fully advantages
of
> Unicode and also a victim of Unicode in sense of compatibility. As no
burden of
> fonts and IME issue for a database to store Unicode at its server. Oracle
> supported very early version of Unicode in its Oracle 7 release as database
> character set AL24UTFFSS which means 3-byte encoding for UTF-FSS. When
> Unicode came to version 2.1, we found our AL24UTFFSS had trouble for 2.1 as
> Hangul's reallocation, and we could not simply update AL24UTFFSS to 2.1
definition
> as it would mess existing users' data in their database. So we came up
with a new
> character set as UTF8 which is still 3-byte encoding to support Unicode
2.1. The
> choice of 3-byte encoding is also bound to AL24UTFFSS implementation as it
would
> not break when users migrate AL24UTFFSS into UTF8.
The Hangul mess took place with Unicode 2.0, not 2.1. And this is a red
herring anyway when we are talking about UTF-8. As stated before, UTF-8 has
never changed even though the Unicode beneath it has changed:
* by moving the Hangul block in version 2.0
* by creating the UTF-16 mechanism to support surrogates in 1993 (not 2001)
The mechanism in UTF-8 to encode characters from U+10000 to U+10FFFF
(actually U+1FFFFF) in 4 bytes was part of the original FSS-UTF specified in
1992. Check the records. It was never "added on" at some later date,
causing existing conformant UTF-8 to break. If Oracle or any other vendor or
developer originally interpreted UTF-8 to use a maximum of 3 bytes to encode
a character, that is either their own misreading of the specification or a
deliberate subsetting of the problem, but in any case that company cannot
claim to be a "victim of Unicode" when they have implemented a clearly
specified Unicode standard incorrectly.
-Doug Ewell
Fullerton, California