RE: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

Kent Karlsson Mon, 24 Nov 2003 05:23:35 -0800

...
> >> Of course, no compression format applied to jamos could
> >> even do as well as UTF-16 applied to syllables, i.e. 2 bytes per
> >> syllable.


I wonder why Hangul would need compression over and above
any other alphabetic script... It has already quite a lot of compression
in the form of precomposed syllables. I think we better start a project
for allocating precomposed "syllables" for many other scripts,
precomposed Latin script syllables, precomposed Greek script
syllables, precomposed Tamil script syllables (most of the Brahmic
derived scripts are especially disadvantaged, from a 'compression'
viewpoint by the virama characters), etc. That should take up much
of the excess space in the unused planes (3-13, decimal).
Unfortunately that mean 4 bytes per non-Hangul syllable (before
byte oriented compression is done), but that could be compensated
by using an SCSU-like approach, just with bigger windows.

        No, this was not serious ;-)
        /kent k

PS
Hangul syllables are "LVT" (actually (L+)(V+)(T*)), not TLV.

RE: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

Reply via email to