Re: Nicest UTF

Doug Ewell Sun, 05 Dec 2004 21:56:42 -0800

Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

> Only the encoder may be a bit complex to write (if one wants to
> generate the optimal smallest result size), but even a moderate
> programmer could find a simple and working scheme with a still
> excellent compression rate (around 1 to 1.2 bytes per character on
> average for any Latin text, and around 1.2 to 1.5 bytes per character
> for Asian texts which would still be a good application of SCSU face
> to UTF-32 or even UTF-8).


If by "Asian texts" you mean CJK ideographs (*), precomposed Hangul, or
Yi syllables, you have no chance of doing better than 2 bytes per
character.  This is because it is not possible in SCSU to set a dynamic
window to any range between U+3400 and U+DFFF, where these characters
reside.  Such a window would be of little use anyway, because real-world
texts using these characters would draw from so many windows that
single-byte mode would be less efficient than Unicode mode, where 2
bytes per character is the norm.  Of course, this is still better than
UTF-32 or UTF-8 for these characters.

For Katakana and Hiragana, you can get the same efficiency with SCSU as
for other small scripts, but very few texts are written in pure kana
except for young children.

Sorry for missing this point in my earlier post.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/
 (*) No, I'm not interested in arguing over this word.

Re: Nicest UTF

Reply via email to