Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)
- Original Message - From: "Frank Yung-Fong Tang" <[EMAIL PROTECTED]> > > >> UTF-166,634,430 bytes > > >> UTF-87,637,601 bytes > > >> SCSU6,414,319 bytes > > >> BOCU-15,897,258 bytes > > >> Legacy encoding (*)5,477,432 bytes > > >> (*) KS C 5601, KS X 1001, or EUC-KR) What is the size of gzip these? Just wonder gzip of UTF-16 gzip of UTF-8 gzip of SCSU gzip of BOCU-1 gzip of Legacy encoding Based on the principles that underly the gzip encoding, and on the fact that the UTF-8 encoding has many three-byte combinations, while UTF-16 / SCSU / BOCU-1/ Legacy have two byte combinations for the same characters, I expect that the *relative* size of the gzipped results will (within ignorable fluctuation) approximately track the relative size of the un-zipped versions, with perhaps, an extra penalty for utf-8 due to the 24-bit combinations interacting worse with the gzip architecture than the 16-bit combinations. But that's speculation. From the work of Atkins et. al. as reported by Doug Ewell I would further expect that BW type compression would give (practically) indistinguishable results for all five cases, as BW has been shown to be particularly encoding form insensitive, unlike Huffman or gzip which work best with true 8-bit symbols. A./
Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)
Frank Yung-Fong Tang wrote: >> UTF-166,634,430 bytes >> UTF-87,637,601 bytes >> SCSU6,414,319 bytes >> BOCU-15,897,258 bytes >> Legacy encoding (*)5,477,432 bytes >> (*) KS C 5601, KS X 1001, or EUC-KR) > > What is the size of gzip these? Just wonder > gzip of UTF-16 > gzip of UTF-8 > gzip of SCSU > gzip of BOCU-1 > gzip of Legacy encoding I don't have gzip, but I can give you the PKZip sizes, which should be quite similar: UTF-162,685,232 bytes UTF-8 2,774,356 bytes SCSU 2,756,470 bytes BOCU-12,772,418 bytes EUC-KR2,518,201 bytes Note that the largest of these is only 10.2% larger than the smallest. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)
Someone else originated that list. Mark __ http://www.macchiato.com â à â - Original Message - From: "Frank Yung-Fong Tang" <[EMAIL PROTECTED]> To: "Mark Davis" <[EMAIL PROTECTED]> Cc: "Doug Ewell" <[EMAIL PROTECTED]>; "Unicode Mailing List" <[EMAIL PROTECTED]>; "Jungshik Shin" <[EMAIL PROTECTED]>; "John Cowan" <[EMAIL PROTECTED]> Sent: Tue, 2003 Dec 02 15:03 Subject: Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries) Mark Davis wrote: > > >> UTF-166,634,430 bytes > > >> UTF-87,637,601 bytes > > >> SCSU6,414,319 bytes > > >> BOCU-15,897,258 bytes > > >> Legacy encoding (*)5,477,432 bytes > > >> (*) KS C 5601, KS X 1001, or EUC-KR) What is the size of gzip these? Just wonder gzip of UTF-16 gzip of UTF-8 gzip of SCSU gzip of BOCU-1 gzip of Legacy encoding -- -- Frank Yung-Fong Tang ÅÃÅtÃm ÃrÃhÃtÃÃt, IÃtÃrnÃtiÃnÃl DÃvÃlÃpmeÃt, AOL IntÃrÃÃtÃvà SÃrviÃes AIM:yungfongta mailto:[EMAIL PROTECTED] Tel:650-937-2913 Yahoo! Msg: frankyungfongtan
Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)
Mark Davis wrote: > > >> UTF-166,634,430 bytes > > >> UTF-87,637,601 bytes > > >> SCSU6,414,319 bytes > > >> BOCU-15,897,258 bytes > > >> Legacy encoding (*)5,477,432 bytes > > >> (*) KS C 5601, KS X 1001, or EUC-KR) What is the size of gzip these? Just wonder gzip of UTF-16 gzip of UTF-8 gzip of SCSU gzip of BOCU-1 gzip of Legacy encoding -- -- Frank Yung-Fong Tang ÅÃÅtÃm ÃrÃhÃtÃÃt, IÃtÃrnÃtiÃnÃl DÃvÃlÃpmeÃt, AOL IntÃrÃÃtÃvà SÃrviÃes AIM:yungfongta mailto:[EMAIL PROTECTED] Tel:650-937-2913 Yahoo! Msg: frankyungfongtan
Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)
Philippe Verdy wrote: > The question of Latin letters with two diacritics added in Latin > Extension B does not seem to respect this constraint, as it is not > justifed in the Vietnames VISCII standard that already does not > contain characters with two diacritics, but already composes them > with two characters in the limited CCS set. Not true. If you like, I can send you a copy of the VISCII report showing not only the mappings, but also their justification. The Viet-Std organization went to great lengths to avoid combining characters, even, as John said, to the point of encoding six graphic characters in the C-zero control area. Perhaps you are thinking of Windows code page 1258, which includes many precomposed letters, but none in the Latin Extension B block, and does require combining marks for vowels with two diacritics. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)
Philippe Verdy scripsit: > The question of Latin letters with two diacritics added in Latin Extension B > does not seem to respect this constraint, as it is not justifed in the > Vietnames VISCII standard that already does not contain characters with two > diacritics, but already composes them with two characters in the limited CCS > set. I'm not sure what standard you are referring to. There are three standards for Vietnamese text: VISCII 1.1 (de facto), TCVN 5712-1 (aka VSCII-1), and TCVN 5712-2 (aka VSCII-2). VISCII provides no combining characters, fills the C1 space with graphics, and even replaces certain C0 characters with graphics. 5712-1 provides combining characters and fills the C1 space with graphics. 5712-2 provides combining characters and leaves both C0 and C1 clear of graphics (and so is ISO 2022-compatible). But all of them provide at least some characters with double diacritics. > I don't know why even ISO10646 would have needed them, unless there's some > Vietnamese DBCS standard that allows representing in a 94x94 matrix all > letters with two diacritics as well as Han ideographs used in Vietnamese. I very much doubt that any such encoding ever existed. -- What is the sound of Perl? Is it not the John Cowan sound of a [Ww]all that people have stopped [EMAIL PROTECTED] banging their head against? --Larryhttp://www.ccil.org/~cowan
RE: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)
John Cowan writes: > > You are, because the floodgates, while once open, have been closed by > > normalization. > > Indeed, they were opened in Unicode 1.1, as a result of the merger with > FDIS 10646; since then, only 46 characters with canonical decompositions > have been added to Unicode (excepting compatibility ideographs, which > are a special case). In fact ISO10646 is to allow an easy one-to-one mapping from existing standard coded character sets (CCS) and unified code points. Accepting precomposed characters is then a necessity when there exists precomposed characters in legacy CCS standard. But they are included only for compatibility (exactly like for compatibility ideographs). The question of Latin letters with two diacritics added in Latin Extension B does not seem to respect this constraint, as it is not justifed in the Vietnames VISCII standard that already does not contain characters with two diacritics, but already composes them with two characters in the limited CCS set. I don't know why even ISO10646 would have needed them, unless there's some Vietnamese DBCS standard that allows representing in a 94x94 matrix all letters with two diacritics as well as Han ideographs used in Vietnamese. I looked within the IBM database of charsets (CCS+CES), and could not find such reference to such EUC-style DBCS. So was it because there was an ongoing/unterminated DBCS standard for Vietnamese, working like GBK, SJIS or KSC 5601 ? __ << ella for Spam Control >> has removed Spam messages and set aside Newsletters for me You can use it too - and it's FREE! http://www.ellaforspam.com <>
Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)
At 08:23 -0500 2003-11-25, John Cowan wrote: Michael Everson scripsit: Ridiculous. This happened centuries ago, and it is not "why" Ethiopic was encoded as a syllabary. It was encoded as a syllabary because it is a syllabary. Structurally it's an abugida, like Indic and UCAS. I disagree. And I don't think Canadian Syllabics are an abugida. But let's leave this one alone, shall we? -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)
Michael Everson scripsit: > Ridiculous. This happened centuries ago, and it is not "why" Ethiopic > was encoded as a syllabary. It was encoded as a syllabary because it > is a syllabary. Structurally it's an abugida, like Indic and UCAS. > You are, because the floodgates, while once open, have been closed by > normalization. Indeed, they were opened in Unicode 1.1, as a result of the merger with FDIS 10646; since then, only 46 characters with canonical decompositions have been added to Unicode (excepting compatibility ideographs, which are a special case). Specifically, 16 were added in Unicode 2.0, 29 in Unicode 1.0, and just one in Unicode 3.2 (the slashed version of a symbol added at the same time). -- "What has four pairs of pants, livesJohn Cowan in Philadelphia, and it never rains http://www.reutershealth.com but it pours?" [EMAIL PROTECTED] --Rufus T. Firefly http://www.ccil.org/~cowan
Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)
On 25/11/2003 03:54, Michael Everson wrote: At 03:41 -0800 2003-11-25, Peter Kirk wrote: ... But the floodgates have already been opened - not just Ethiopic but Greek extended, much of Latin extended, the Korean syllables which started this discussion, the small amount of precomposed Hebrew which we already have, etc. People have tried to force them shut, and with good reason. But don't accuse me of starting something new. You are, because the floodgates, while once open, have been closed by normalization. Read what I wrote before: This approach would certainly have simplified pointed Hebrew a lot, ... But I guess it is too late for a change now! I recognised clearly that it is too late to make this change now, although it might have been a good thing to do when the floodgates were open (although as Mark pointed out it would not necessarily have made things easier). I don't want to reopen them. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)
At 03:41 -0800 2003-11-25, Peter Kirk wrote: After all, Ethiopic was encoded as a syllabary just because the vowel points happen to have become attached to the base characters. Ridiculous. This happened centuries ago, and it is not "why" Ethiopic was encoded as a syllabary. It was encoded as a syllabary because it is a syllabary. But the floodgates have already been opened - not just Ethiopic but Greek extended, much of Latin extended, the Korean syllables which started this discussion, the small amount of precomposed Hebrew which we already have, etc. People have tried to force them shut, and with good reason. But don't accuse me of starting something new. You are, because the floodgates, while once open, have been closed by normalization. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)
On 24/11/2003 17:56, Christopher John Fynn wrote: "Peter Kirk" <[EMAIL PROTECTED]> wrote: This approach would certainly have simplified pointed Hebrew a lot, so much so that it could well be serious. After all, Ethiopic was encoded as a syllabary just because the vowel points happen to have become attached to the base characters. And we already have some precomposed Hebrew syllables, FB1D, FB1F, FB2E, FB2F. But I guess it is too late for a change now! Please don't even think of it - acceptance of any proposal for precomposed characters for one script would open the floodgates for similar proposals for other scripts. -- Christopher J. Fynn But the floodgates have already been opened - not just Ethiopic but Greek extended, much of Latin extended, the Korean syllables which started this discussion, the small amount of precomposed Hebrew which we already have, etc. People have tried to force them shut, and with good reason. But don't accuse me of starting something new. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
RE: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)
Christopher John Fynn wrote: > "Peter Kirk" <[EMAIL PROTECTED]> wrote: > > > This approach would certainly have simplified pointed Hebrew a lot, so > > much so that it could well be serious. After all, Ethiopic was encoded > > as a syllabary just because the vowel points happen to have become > > attached to the base characters. And we already have some precomposed > > Hebrew syllables, FB1D, FB1F, FB2E, FB2F. But I guess it is too late for > > a change now! > > Please don't even think of it - acceptance of any proposal for > precomposed characters for one script would open the floodgates > for similar proposals for other scripts. Isn't it what happened to the Latin script with floods of precomposed characters, notably letters with double accents which were not necessary to support and map correctly the VISCII standard? Floodgates are already opened, but the composition stability policy would require that most additional precomposed characters would need to be excluded from normalized composition forms. As such introduction of precomposed but excluded characters would not occur in any normalized text, it would be justified only to support bijective mapping with another standard that allows distinction between composed and decomposed characters... __ << ella for Spam Control >> has removed Spam messages and set aside Newsletters for me You can use it too - and it's FREE! http://www.ellaforspam.com <>
Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)
On 11/24/03 20:56, Christopher John Fynn wrote: "Peter Kirk" <[EMAIL PROTECTED]> wrote: This approach would certainly have simplified pointed Hebrew a lot, so much so that it could well be serious. After all, Ethiopic was encoded as a syllabary just because the vowel points happen to have become attached to the base characters. And we already have some precomposed Hebrew syllables, FB1D, FB1F, FB2E, FB2F. But I guess it is too late for a change now! Please don't even think of it - acceptance of any proposal for precomposed characters for one script would open the floodgates for similar proposals for other scripts. I really don't think this is a good model for Hebrew anyway. Besides, if you think the weird exceptions of Biblical typesetting are a pain with the current cons+vow model, imagine what a nightmare they'd be with precomposed syllables. ~mark
Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)
"Peter Kirk" <[EMAIL PROTECTED]> wrote: > This approach would certainly have simplified pointed Hebrew a lot, so > much so that it could well be serious. After all, Ethiopic was encoded > as a syllabary just because the vowel points happen to have become > attached to the base characters. And we already have some precomposed > Hebrew syllables, FB1D, FB1F, FB2E, FB2F. But I guess it is too late for > a change now! Please don't even think of it - acceptance of any proposal for precomposed characters for one script would open the floodgates for similar proposals for other scripts. -- Christopher J. Fynn
Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)
Peter Kirk scripsit: > This approach would certainly have simplified pointed Hebrew a lot, so > much so that it could well be serious. There are an awful lot of possibilities, and it's not clear that spinning them out a la Hangul really makes sense. > After all, Ethiopic was encoded > as a syllabary just because the vowel points happen to have become > attached to the base characters. Well, more because Ethiopic-script users think of the letters as part of a syllabary, though historically it's an abugida. The original design for Unicode Ethiopic used an alphabetic representation -- someone else can probably tell you more about the nitty-gritty of why it was rejected. > And we already have some precomposed > Hebrew syllables, FB1D, FB1F, FB2E, FB2F. But I guess it is too late for > a change now! It certainly is. -- Go, and never darken my towels again! John Cowan --Rufus T. Firefly www.ccil.org/~cowan
RE: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)
Kent Karlsson wrote: > Hangul syllables are "LVT" (actually (L+)(V+)(T*)), not TLV. Sorry, I use so often the acronym TLV which means in French "Type, Longueur, Valeur" (and is completely unrelated to Unicode or Hangul syllable types), that this often confuses me with the English LVT for "Leading consonnant, Vowel, Trailing consonnant". Acronyms are so much misleading... __ << ella for Spam Control >> has removed Spam messages and set aside Newsletters for me You can use it too - and it's FREE! http://www.ellaforspam.com <>
Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)
On 24/11/2003 03:29, Kent Karlsson wrote: ... I wonder why Hangul would need compression over and above any other alphabetic script... It has already quite a lot of compression in the form of precomposed syllables. I think we better start a project for allocating precomposed "syllables" for many other scripts, ... No, this was not serious ;-) /kent k This approach would certainly have simplified pointed Hebrew a lot, so much so that it could well be serious. After all, Ethiopic was encoded as a syllabary just because the vowel points happen to have become attached to the base characters. And we already have some precomposed Hebrew syllables, FB1D, FB1F, FB2E, FB2F. But I guess it is too late for a change now! -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
RE: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)
... > >> Of course, no compression format applied to jamos could > >> even do as well as UTF-16 applied to syllables, i.e. 2 bytes per > >> syllable. I wonder why Hangul would need compression over and above any other alphabetic script... It has already quite a lot of compression in the form of precomposed syllables. I think we better start a project for allocating precomposed "syllables" for many other scripts, precomposed Latin script syllables, precomposed Greek script syllables, precomposed Tamil script syllables (most of the Brahmic derived scripts are especially disadvantaged, from a 'compression' viewpoint by the virama characters), etc. That should take up much of the excess space in the unused planes (3-13, decimal). Unfortunately that mean 4 bytes per non-Hangul syllable (before byte oriented compression is done), but that could be compensated by using an SCSU-like approach, just with bigger windows. No, this was not serious ;-) /kent k PS Hangul syllables are "LVT" (actually (L+)(V+)(T*)), not TLV.
Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)
Mark Davis wrote: >> Of course, no compression format applied to jamos could >> even do as well as UTF-16 applied to syllables, i.e. 2 bytes per >> syllable. > > This needs a bit of qualification. An arithmetic compression would do > better, for example, or even just a compression that took the most > frequent jamo sequences. Perhaps the above is better phrased as 'no > simple byte-level compression format...'. Yes, that's what I meant: a compression *format* like SCSU or BOCU-1, as opposed to a (general-purpose) compression *algorithm* like Huffman or LZ or arithmetic coding. The distinction makes sense in the context of my paper, but I probably should have explained it here. BTW, the paper is awaiting final comments from one last reviewer. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)
>Of course, no compression format applied to jamos could > even do as well as UTF-16 applied to syllables, i.e. 2 bytes per > syllable. This needs a bit of qualification. An arithmetic compression would do better, for example, or even just a compression that took the most frequent jamo sequences. Perhaps the above is better phrased as 'no simple byte-level compression format...'. Mark __ http://www.macchiato.com â à â - Original Message - From: "Doug Ewell" <[EMAIL PROTECTED]> To: "Unicode Mailing List" <[EMAIL PROTECTED]> Cc: "Jungshik Shin" <[EMAIL PROTECTED]>; "John Cowan" <[EMAIL PROTECTED]> Sent: Sat, 2003 Nov 22 22:53 Subject: Korean compression (was: Re: Ternary search trees for Unicode dictionaries) > Jungshik Shin wrote: > > >> The file they used, called "arirang.txt," contains over 3.3 million > >> Unicode characters and was apparently once part of their "Florida > >> Tech Corpus of Multi-Lingual Text" but subsequently deleted for > >> reasons not known to me. I can supply it if you're interested. > > > > It'd be great if you could. > > Try > http://www.cs.fit.edu/~ryan/compress/corpora/korean/arirang/arirang.txt > first. If that doesn't work, I'll send you a copy. It's over 5 > megabytes, so I'd like to avoid that if possible. > > >> The statistics on this file are as follows: > >> > >> UTF-166,634,430 bytes > >> UTF-87,637,601 bytes > >> SCSU6,414,319 bytes > >> BOCU-15,897,258 bytes > >> Legacy encoding (*)5,477,432 bytes > >> (*) KS C 5601, KS X 1001, or EUC-KR) > > > > Sorry to pick on this (when I have to thank you). Even with > > coded character set vs character encoding scheme distinction aside > > (that is, we just think in terms of character repertoire), KS C 5601/ > > KS X 1001 _alone_ cannot represent any Korean text unless you're > > willing to live with double width space, Latin letters, numbers and > > punctuations (since you wrote the file has apparently full stops and > > spaces in ASCII, it does include characters outside KS X 1001) On the > > other hand, EUC-KR (KS X 1001 + ISO 646:KR/US-ASCII) can. Actually, I > > suspect the legacy encoding used was Windows codepage 949(or JOHAB/ > > Windows-1361?) because I can't imagine there is not a single syllable > > (that is outside the charater repertoire of KS X 1001) out of over 2 > > million syllables > > Sorry, I should have noticed on Atkin and Stansifer's data page > (http://www.cs.fit.edu/~ryan/compress/) that the file is in EUC-KR. All > I knew was that I was able to import it into SC UniPad using the option > marked "KS C 5601 / KS X 1001, EUC-KR (Korean)". > > >> I used my own SCSU encoder to achieve these results, but it really > >> wouldn't matter which was chosen -- Korean syllables can be encoded > >> in SCSU *only* by using Unicode mode. It's not possible to set a > >> window to the Korean syllable range. > > > > Now that you told me you used NFC, isn't this condition similar to > > Chinese text? How does BOCU and SCSU work for Chinese text? Japanese > > text might do slightly better with Kana, but isn't likely to be much > > better. > > Well, *I* didn't use NFC for anything. That's just how the file came to > me. And yes, the situation is exactly the same for Chinese text, except > I suppose that with 20,000-some basic Unihan characters, plus Extension > A and B, plus the compatibility guys starting at U+F900, one might not > realistically expect any better than 16 bits per character. OTOH, when > dealing with 11,171 Hangul syllables interspersed with Basic Latin, I > imagine there is some room for improvement over UTF-16. > > I'm intrigued by the improved performance of BOCU-1 on Korean text, and > I'm now interested in finding a way to achieve even better compression > of Hangul syllables, using a strategy *not* much more complex than SCSU > or BOCU and *not* involving huge reordering tables. Your assistance, > and anyone else's, would be welcome. Googling for "Korean compression" > or "Hang[e]ul compression" turns up practically nothing, so there is a > chance to break some new ground here. > > John Cowan responded to Jungshik's > comment about Kana: > > > The SCSU paper claims that Japanese does *much* better in SCSU than > > UTF-16, thanks to the kana. > > The example in Section 9.3 would appear to substantiate that claim, as > 116 Unicode characters (= 232 bytes of UTF-16) are compressed to 178 > bytes of SCSU. > > Back to Jungshik: > > >> Only the large number of spaces and full > >> stops in this file prevented SCSU from degenerating entirely to 2 > >> bytes per character. > > > > That's why I asked. What I'm curious about is how SCSU and BOCU > > of NFD (and what I and Kent [2] think should have been NFD with the > > possible code point rearragement of Jamo block to facilate a smaller > > window size for SCSU) would compare with uncompressed UTF-16 of NFC > > (SCSU/BOCU isn't much better than UTF-16). The b