>Of course, no compression format applied to jamos could > even do as well as UTF-16 applied to syllables, i.e. 2 bytes per > syllable.
This needs a bit of qualification. An arithmetic compression would do better, for example, or even just a compression that took the most frequent jamo sequences. Perhaps the above is better phrased as 'no simple byte-level compression format...'. Mark __________________________________ http://www.macchiato.com â ààààààààààààààààààààà â ----- Original Message ----- From: "Doug Ewell" <[EMAIL PROTECTED]> To: "Unicode Mailing List" <[EMAIL PROTECTED]> Cc: "Jungshik Shin" <[EMAIL PROTECTED]>; "John Cowan" <[EMAIL PROTECTED]> Sent: Sat, 2003 Nov 22 22:53 Subject: Korean compression (was: Re: Ternary search trees for Unicode dictionaries) > Jungshik Shin <jshin at mailaps dot org> wrote: > > >> The file they used, called "arirang.txt," contains over 3.3 million > >> Unicode characters and was apparently once part of their "Florida > >> Tech Corpus of Multi-Lingual Text" but subsequently deleted for > >> reasons not known to me. I can supply it if you're interested. > > > > It'd be great if you could. > > Try > http://www.cs.fit.edu/~ryan/compress/corpora/korean/arirang/arirang.txt > first. If that doesn't work, I'll send you a copy. It's over 5 > megabytes, so I'd like to avoid that if possible. > > >> The statistics on this file are as follows: > >> > >> UTF-16 6,634,430 bytes > >> UTF-8 7,637,601 bytes > >> SCSU 6,414,319 bytes > >> BOCU-1 5,897,258 bytes > >> Legacy encoding (*) 5,477,432 bytes > >> (*) KS C 5601, KS X 1001, or EUC-KR) > > > > Sorry to pick on this (when I have to thank you). Even with > > coded character set vs character encoding scheme distinction aside > > (that is, we just think in terms of character repertoire), KS C 5601/ > > KS X 1001 _alone_ cannot represent any Korean text unless you're > > willing to live with double width space, Latin letters, numbers and > > punctuations (since you wrote the file has apparently full stops and > > spaces in ASCII, it does include characters outside KS X 1001) On the > > other hand, EUC-KR (KS X 1001 + ISO 646:KR/US-ASCII) can. Actually, I > > suspect the legacy encoding used was Windows codepage 949(or JOHAB/ > > Windows-1361?) because I can't imagine there is not a single syllable > > (that is outside the charater repertoire of KS X 1001) out of over 2 > > million syllables > > Sorry, I should have noticed on Atkin and Stansifer's data page > (http://www.cs.fit.edu/~ryan/compress/) that the file is in EUC-KR. All > I knew was that I was able to import it into SC UniPad using the option > marked "KS C 5601 / KS X 1001, EUC-KR (Korean)". > > >> I used my own SCSU encoder to achieve these results, but it really > >> wouldn't matter which was chosen -- Korean syllables can be encoded > >> in SCSU *only* by using Unicode mode. It's not possible to set a > >> window to the Korean syllable range. > > > > Now that you told me you used NFC, isn't this condition similar to > > Chinese text? How does BOCU and SCSU work for Chinese text? Japanese > > text might do slightly better with Kana, but isn't likely to be much > > better. > > Well, *I* didn't use NFC for anything. That's just how the file came to > me. And yes, the situation is exactly the same for Chinese text, except > I suppose that with 20,000-some basic Unihan characters, plus Extension > A and B, plus the compatibility guys starting at U+F900, one might not > realistically expect any better than 16 bits per character. OTOH, when > dealing with 11,171 Hangul syllables interspersed with Basic Latin, I > imagine there is some room for improvement over UTF-16. > > I'm intrigued by the improved performance of BOCU-1 on Korean text, and > I'm now interested in finding a way to achieve even better compression > of Hangul syllables, using a strategy *not* much more complex than SCSU > or BOCU and *not* involving huge reordering tables. Your assistance, > and anyone else's, would be welcome. Googling for "Korean compression" > or "Hang[e]ul compression" turns up practically nothing, so there is a > chance to break some new ground here. > > John Cowan <cowan at mercury dot ccil dot org> responded to Jungshik's > comment about Kana: > > > The SCSU paper claims that Japanese does *much* better in SCSU than > > UTF-16, thanks to the kana. > > The example in Section 9.3 would appear to substantiate that claim, as > 116 Unicode characters (= 232 bytes of UTF-16) are compressed to 178 > bytes of SCSU. > > Back to Jungshik: > > >> Only the large number of spaces and full > >> stops in this file prevented SCSU from degenerating entirely to 2 > >> bytes per character. > > > > That's why I asked. What I'm curious about is how SCSU and BOCU > > of NFD (and what I and Kent [2] think should have been NFD with the > > possible code point rearragement of Jamo block to facilate a smaller > > window size for SCSU) would compare with uncompressed UTF-16 of NFC > > (SCSU/BOCU isn't much better than UTF-16). The back of an envelope > > calculation gives me 2.5 ~ 3 bytes per syllable (without the code > > point rearrangement to put them within a 64 character-long window [1]) > > so it's still worse than UTF-16. However, that's not as bad as ~5 > > bytes (or more) per syllable without SCSU/BOCU-1. I have to confess > > that I just have a very cursory understanding of SCSU/BOCU-1. > > When this file is broken down into jamos (NFD), SCSU regains its > supremacy: > > UTF-8: 17,092,140 bytes > BOCU-1: 8,728,553 bytes > SCSU: 7,750,957 bytes > > And you are correct that SCSU (and for that matter, BOCU-1) performance > would have been better if the jamos used in modern Korean had been > arranged to fit in a 128-character window (64 would not have been > necessary). As it is, SCSU does have to do some switching between the > two windows. Of course, no compression format applied to jamos could > even do as well as UTF-16 applied to syllables, i.e. 2 bytes per > syllable. > > -Doug Ewell > Fullerton, California > http://users.adelphia.net/~dewell/ > > >