RE: Ternary search trees for Unicode dictionaries

Philippe Verdy Sun, 23 Nov 2003 06:02:30 -0800

Jungshik Shin writes* on sun 23-nov-2003 03:51 to Doug Ewell:

> On Thu, 20 Nov 2003, Doug Ewell wrote:
> 
> > Jungshik Shin <jshin at mailaps dot org> wrote:
> >
> > The file is all in syllables, not jamos, which I guess means it's in
> > NFC.
> 
>   Yes, it's in NFC, then.
> 
> > The statistics on this file are as follows:
> >
> > UTF-16    6,634,430 bytes
> > UTF-8    7,637,601 bytes
> > SCSU    6,414,319 bytes
> > BOCU-1    5,897,258 bytes
> > Legacy encoding (*)    5,477,432 bytes
> >     (*) KS C 5601, KS X 1001, or EUC-KR)
> 
> > Only the large number of spaces and full
> > stops in this file prevented SCSU from degenerating entirely to 2 bytes
> > per character.
> 
>    That's why I asked. What I'm curious about is how SCSU and BOCU
> of NFD (and what I and Kent [2] think should have been NFD with 
> the possible
> code point rearragement of Jamo block to facilate a smaller window size
> for SCSU) would compare with uncompressed UTF-16 of NFC (SCSU/BOCU isn't
> much better than UTF-16).  The back of an envelope calculation gives me
> 2.5 ~ 3 bytes per syllable (without the code point rearrangement to put
> them within a 64 character-long window [1]) so it's still worse than
> UTF-16. However, that's not as bad as ~5 bytes (or more) per syllable
> without SCSU/BOCU-1. I have to confess that I just have a very cursory
> understanding of SCSU/BOCU-1.
> 
>  As for why I'm interested in NFD, one of the reason is that it is easier
> to implement incremental search (among many other things) in terms of
> NFD. The only __editor__ I've ever seen support incremental search for
> Korean per letter (as opposed to per syllable) is Hemacs (Hangul Emacs).
> Not having incremental search per letter in Korean is like not being able
> to match 'apples/apple/apply', 'interesting/interested', and 'learned'
> with 'appl', 'interest' and 'learn'.


For Korean text, I have found that representation with "defective" syllables
was performing better through SCSU. I mean here decomposing the TLV
syllables of the NFC form into T and LV, and TL into T and L, i.e. with
partial decomposition.

Then the text only contains T or L or V jamos, or LV syllables (which is
only a small fragment of all the encoded Hangul syllables), and the text has
this form:
        ( ((T* L* LV V*) - (T* L+ V)) | X )*
where X are other non Hangul characters (like Z-separators, Punctuations,
Numbers, Symbols and C-others, or some ideographic Letters.) Once can get it
by computing the standard NFC algorithm but with supplementary canonical
combining exclusions for TL and TLV characters.

With this constraint, Korean is no more acting like Han, and the precombined
arrangements of LV syllables saves much on the SCSU window; gains are also
significant for for other binary compressors like LZW on any UTF scheme, and
even with Huffman or Arithmetic coding of UTF-16*/UTF-32* schemes.

Performance for incremental search is also better than full decomposition,
as there's still a significant reduction of the number of codepoints face to
NFD's full decomposition.

However I did not test this on very large text files like the ones you
propose here. May be you could test it.


__________________________________________________________________
<< ella for Spam Control >> has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE!  http://www.ellaforspam.com

<<attachment: winmail.dat>>

RE: Ternary search trees for Unicode dictionaries

Reply via email to