Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

Mark Davis Sun, 23 Nov 2003 12:05:10 -0800

>Of course, no compression format applied to jamos could
> even do as well as UTF-16 applied to syllables, i.e. 2 bytes per
> syllable.


This needs a bit of qualification. An arithmetic compression would do better,
for example, or even just a compression that took the most frequent jamo
sequences. Perhaps the above is better phrased as 'no simple byte-level
compression format...'.

Mark
__________________________________
http://www.macchiato.com
â ààààààààààààààààààààà â

----- Original Message ----- 
From: "Doug Ewell" <[EMAIL PROTECTED]>
To: "Unicode Mailing List" <[EMAIL PROTECTED]>
Cc: "Jungshik Shin" <[EMAIL PROTECTED]>; "John Cowan" <[EMAIL PROTECTED]>
Sent: Sat, 2003 Nov 22 22:53
Subject: Korean compression (was: Re: Ternary search trees for Unicode
dictionaries)


> Jungshik Shin <jshin at mailaps dot org> wrote:
>
> >> The file they used, called "arirang.txt," contains over 3.3 million
> >> Unicode characters and was apparently once part of their "Florida
> >> Tech Corpus of Multi-Lingual Text" but subsequently deleted for
> >> reasons not known to me.  I can supply it if you're interested.
> >
> > It'd be great if you could.
>
> Try
> http://www.cs.fit.edu/~ryan/compress/corpora/korean/arirang/arirang.txt
> first.  If that doesn't work, I'll send you a copy.  It's over 5
> megabytes, so I'd like to avoid that if possible.
>
> >> The statistics on this file are as follows:
> >>
> >> UTF-16    6,634,430 bytes
> >> UTF-8    7,637,601 bytes
> >> SCSU    6,414,319 bytes
> >> BOCU-1    5,897,258 bytes
> >> Legacy encoding (*)    5,477,432 bytes
> >>     (*) KS C 5601, KS X 1001, or EUC-KR)
> >
> > Sorry to pick on this (when I have to thank you). Even with
> > coded character set vs character encoding scheme distinction aside
> > (that is, we just think in terms of character repertoire), KS C 5601/
> > KS X 1001 _alone_ cannot represent any Korean text unless you're
> > willing to live with double width space, Latin letters, numbers and
> > punctuations (since you wrote the file has apparently full stops and
> > spaces in ASCII, it does include characters outside KS X 1001)  On the
> > other hand, EUC-KR (KS X 1001 + ISO 646:KR/US-ASCII) can. Actually, I
> > suspect the legacy encoding used was Windows codepage 949(or JOHAB/
> > Windows-1361?) because I can't imagine there is not a single syllable
> > (that is outside the charater repertoire of KS X 1001) out of over 2
> > million syllables
>
> Sorry, I should have noticed on Atkin and Stansifer's data page
> (http://www.cs.fit.edu/~ryan/compress/) that the file is in EUC-KR.  All
> I knew was that I was able to import it into SC UniPad using the option
> marked "KS C 5601 / KS X 1001, EUC-KR (Korean)".
>
> >> I used my own SCSU encoder to achieve these results, but it really
> >> wouldn't matter which was chosen -- Korean syllables can be encoded
> >> in SCSU *only* by using Unicode mode.  It's not possible to set a
> >> window to the Korean syllable range.
> >
> > Now that you told me you used NFC, isn't this condition similar to
> > Chinese text? How does BOCU and SCSU work for Chinese text?  Japanese
> > text might do slightly better with Kana, but isn't likely to be much
> > better.
>
> Well, *I* didn't use NFC for anything.  That's just how the file came to
> me.  And yes, the situation is exactly the same for Chinese text, except
> I suppose that with 20,000-some basic Unihan characters, plus Extension
> A and B, plus the compatibility guys starting at U+F900, one might not
> realistically expect any better than 16 bits per character.  OTOH, when
> dealing with 11,171 Hangul syllables interspersed with Basic Latin, I
> imagine there is some room for improvement over UTF-16.
>
> I'm intrigued by the improved performance of BOCU-1 on Korean text, and
> I'm now interested in finding a way to achieve even better compression
> of Hangul syllables, using a strategy *not* much more complex than SCSU
> or BOCU and *not* involving huge reordering tables.  Your assistance,
> and anyone else's, would be welcome.  Googling for "Korean compression"
> or "Hang[e]ul compression" turns up practically nothing, so there is a
> chance to break some new ground here.
>
> John Cowan <cowan at mercury dot ccil dot org> responded to Jungshik's
> comment about Kana:
>
> > The SCSU paper claims that Japanese does *much* better in SCSU than
> > UTF-16, thanks to the kana.
>
> The example in Section 9.3 would appear to substantiate that claim, as
> 116 Unicode characters (= 232 bytes of UTF-16) are compressed to 178
> bytes of SCSU.
>
> Back to Jungshik:
>
> >> Only the large number of spaces and full
> >> stops in this file prevented SCSU from degenerating entirely to 2
> >> bytes per character.
> >
> > That's why I asked. What I'm curious about is how SCSU and BOCU
> > of NFD (and what I and Kent [2] think should have been NFD with the
> > possible code point rearragement of Jamo block to facilate a smaller
> > window size for SCSU) would compare with uncompressed UTF-16 of NFC
> > (SCSU/BOCU isn't much better than UTF-16).  The back of an envelope
> > calculation gives me 2.5 ~ 3 bytes per syllable (without the code
> > point rearrangement to put them within a 64 character-long window [1])
> > so it's still worse than UTF-16. However, that's not as bad as ~5
> > bytes (or more) per syllable without SCSU/BOCU-1. I have to confess
> > that I just have a very cursory understanding of SCSU/BOCU-1.
>
> When this file is broken down into jamos (NFD), SCSU regains its
> supremacy:
>
> UTF-8:    17,092,140 bytes
> BOCU-1:    8,728,553 bytes
> SCSU:    7,750,957 bytes
>
> And you are correct that SCSU (and for that matter, BOCU-1) performance
> would have been better if the jamos used in modern Korean had been
> arranged to fit in a 128-character window (64 would not have been
> necessary).  As it is, SCSU does have to do some switching between the
> two windows.  Of course, no compression format applied to jamos could
> even do as well as UTF-16 applied to syllables, i.e. 2 bytes per
> syllable.
>
> -Doug Ewell
>  Fullerton, California
>  http://users.adelphia.net/~dewell/
>
>
>

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

Reply via email to