On Fri, Apr 30, 2004 at 10:58:19PM +0700, Martin Hosken wrote:
> Dear Tim,
> 
> >"CESU-8 defines an encoding scheme for Unicode identical to UTF-8
> >except for its representation of supplementary characters. In CESU-8,
> >supplementary characters are represented as six-byte sequences
> >resulting from the transformation of each UTF-16 surrogate code
> >unit into an eight-bit form similar to the UTF-8 transformation, but
> >without first converting the input surrogate pairs to a scalar value."
> >
> >Yes, that sounds like it.  But see my quote from Oracle docs in my
> >reply to Lincoln's email to make sure.
> >
> >(I presume it dates from before UTF16 had surrogate pairs. When
> >they were added to UTF16 they gave a name "CESU-8" to what old UTF16
> >to UTF8 conversion code would produce when given surrogate pairs.
> >A classic standards maneuver :)
> 
> IIRC AL32UTF8 was introduced at the behest of Oracle (a voting member of 
> Unicode) because they were storing higher plane codes using the 
> surrogate pair technique of UTF-16 mapped into UTF-8 (i.e. resulting in 
> 2 UTF-8 chars or 6 bytes) rather than the correct UTF-8 way of a single 
> char of 4+ bytes. There is no real trouble doing it that way since 
> anyone can convert between the 'wrong' UTF-8 and the correct form. But 
> they found that if you do a simple binary based sort of a string in 
> AL32UTF8 and compare it to a sort in true UTF-8 you end up with a subtly 
> different order. On this basis they made request to the UTC to have 
> AL32UTF8 added as a kludge and out of the kindness of their hearts the 
> UTC agreed thus saving Oracle from a whole heap of work. But all are 
> agreed that UTF-8 and not AL32UTF8 is the way forward.

Um, now you've confused me.

The Oracle docs say "In AL32UTF8, one supplementary character is
represented in one code point, totalling four bytes." which you
say is "correct UTF-8 way". So the old Oracle ``UTF8'' charset
is what's now called "CESU-8" and what Oracle call ``AL32UTF8''
is the "correct UTF-8 way".

So did you mean CESU-8 when you said AL32UTF8?

Tim.

Reply via email to