Re: Ternary search trees for Unicode dictionaries

Doug Ewell Sun, 23 Nov 2003 14:45:41 -0800

Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

> For Korean text, I have found that representation with "defective"
> syllables was performing better through SCSU. I mean here decomposing
> the TLV syllables of the NFC form into T and LV, and TL into T and L,
> i.e. with partial decomposition.
> ...
> With this constraint, Korean is no more acting like Han, and the
> precombined arrangements of LV syllables saves much on the SCSU
> window; gains are also significant for for other binary compressors
> like LZW on any UTF scheme, and even with Huffman or Arithmetic coding
> of UTF-16*/UTF-32* schemes.


This seems reasonable, except that you have to transform the text from
its original representation to this special, compression-friendly
format.  Data to be compressed will not come pre-packaged in this
partially decomposed form, but will likely be either fully composed
syllables or fully decomposed jamos.  So you really have to perform two
layers of transformation, one to prepare the data for compression and
another to actually compress it, and of course you must do the same
thing in reverse to decompress the data.

This adds complexity, but is sometimes worth the effort.  The
Burrows-Wheeler block-sorting approach, for example, achieves very good
results by adding a preprocessing step before "conventional" Huffman or
arithmetic compression.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

Re: Ternary search trees for Unicode dictionaries

Reply via email to