Re: Unicode forms for internal storage - BOCU-1 speed

jcowan Thu, 22 Jan 2004 12:50:14 -0800

Markus Scherer scripsit:

> UTF-8 is useful because it's simple, and supported just about everywhere - 
> but it's otherwise hardly optimal for anything.


You entirely omit its principal advantage, sine qua non:  it's maximally
ASCII-compatible, using bytes 0x00 to 0x7F to represent ASCII characters and
nothing else.

Mark Crispin's UTF-9 (not to be confused with Jerome Abela's) is also
excellent, although most of us don't have 36-bit systems, for which it
makes sense.  A precis:

Code points (base 2)            UTF-9 code units (base 2)
0000000000000abcdefgh           0abcdefgh
00000abcdefghijklmnop           1abcdefgh 0ijklmnop
abcdefghijklmnopqrstu           1000abcde 1fghijklm 0nopqrstu

This is almost as good as Latin-1 for its repertoire, only minutely worse
than UTF-16 for the rest of the BMP, and beats all other encodings for the
other planes.

-- 
John Cowan                              <[EMAIL PROTECTED]>
http://www.ccil.org/~cowan              http://www.reutershealth.com
                Charles li reis, nostre emperesdre magnes,
                Set anz totz pleinz ad ested in Espagnes.

Re: Unicode forms for internal storage - BOCU-1 speed

Reply via email to