Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

2003-12-03 Thread Asmus Freytag
- Original Message - From: "Frank Yung-Fong Tang" <[EMAIL PROTECTED]> > > >> UTF-166,634,430 bytes > > >> UTF-87,637,601 bytes > > >> SCSU6,414,319 bytes > > >> BOCU-15,897,258 bytes > > >> Legacy encoding (*)5,477,432 bytes > > >> (*) KS C 5601, KS X 1001, or

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

2003-12-02 Thread Doug Ewell
Frank Yung-Fong Tang wrote: >> UTF-166,634,430 bytes >> UTF-87,637,601 bytes >> SCSU6,414,319 bytes >> BOCU-15,897,258 bytes >> Legacy encoding (*)5,477,432 bytes >> (*) KS C 5601, KS X 1001, or EUC-KR) > > What is the size of gzip these? Just wonder > gzip of UTF-16 > gzi

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

2003-12-02 Thread Mark Davis
uot; <[EMAIL PROTECTED]>; "Unicode Mailing List" <[EMAIL PROTECTED]>; "Jungshik Shin" <[EMAIL PROTECTED]>; "John Cowan" <[EMAIL PROTECTED]> Sent: Tue, 2003 Dec 02 15:03 Subject: Re: Korean compression (was: Re: Ternary search trees for Unicode d

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

2003-12-02 Thread Frank Yung-Fong Tang
Mark Davis wrote: > > >> UTF-166,634,430 bytes > > >> UTF-87,637,601 bytes > > >> SCSU6,414,319 bytes > > >> BOCU-15,897,258 bytes > > >> Legacy encoding (*)5,477,432 bytes > > >> (*) KS C 5601, KS X 1001, or EUC-KR) What is the size of gzip these? Just wonder gzip

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

2003-11-25 Thread Doug Ewell
Philippe Verdy wrote: > The question of Latin letters with two diacritics added in Latin > Extension B does not seem to respect this constraint, as it is not > justifed in the Vietnames VISCII standard that already does not > contain characters with two diacritics, but already composes them > wit

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

2003-11-25 Thread John Cowan
Philippe Verdy scripsit: > The question of Latin letters with two diacritics added in Latin Extension B > does not seem to respect this constraint, as it is not justifed in the > Vietnames VISCII standard that already does not contain characters with two > diacritics, but already composes them wit

RE: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

2003-11-25 Thread Philippe Verdy
John Cowan writes: > > You are, because the floodgates, while once open, have been closed by > > normalization. > > Indeed, they were opened in Unicode 1.1, as a result of the merger with > FDIS 10646; since then, only 46 characters with canonical decompositions > have been added to Unicode (exce

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

2003-11-25 Thread Michael Everson
At 08:23 -0500 2003-11-25, John Cowan wrote: Michael Everson scripsit: Ridiculous. This happened centuries ago, and it is not "why" Ethiopic was encoded as a syllabary. It was encoded as a syllabary because it is a syllabary. Structurally it's an abugida, like Indic and UCAS. I disagree. And I

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

2003-11-25 Thread John Cowan
Michael Everson scripsit: > Ridiculous. This happened centuries ago, and it is not "why" Ethiopic > was encoded as a syllabary. It was encoded as a syllabary because it > is a syllabary. Structurally it's an abugida, like Indic and UCAS. > You are, because the floodgates, while once open, have

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

2003-11-25 Thread Peter Kirk
On 25/11/2003 03:54, Michael Everson wrote: At 03:41 -0800 2003-11-25, Peter Kirk wrote: ... But the floodgates have already been opened - not just Ethiopic but Greek extended, much of Latin extended, the Korean syllables which started this discussion, the small amount of precomposed Hebrew wh

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

2003-11-25 Thread Michael Everson
At 03:41 -0800 2003-11-25, Peter Kirk wrote: After all, Ethiopic was encoded as a syllabary just because the vowel points happen to have become attached to the base characters. Ridiculous. This happened centuries ago, and it is not "why" Ethiopic was encoded as a syllabary. It was encoded as a s

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

2003-11-25 Thread Peter Kirk
On 24/11/2003 17:56, Christopher John Fynn wrote: "Peter Kirk" <[EMAIL PROTECTED]> wrote: This approach would certainly have simplified pointed Hebrew a lot, so much so that it could well be serious. After all, Ethiopic was encoded as a syllabary just because the vowel points happen to have be

RE: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

2003-11-25 Thread Philippe Verdy
Christopher John Fynn wrote: > "Peter Kirk" <[EMAIL PROTECTED]> wrote: > > > This approach would certainly have simplified pointed Hebrew a lot, so > > much so that it could well be serious. After all, Ethiopic was encoded > > as a syllabary just because the vowel points happen to have become > >

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

2003-11-24 Thread Mark E. Shoulson
On 11/24/03 20:56, Christopher John Fynn wrote: "Peter Kirk" <[EMAIL PROTECTED]> wrote: This approach would certainly have simplified pointed Hebrew a lot, so much so that it could well be serious. After all, Ethiopic was encoded as a syllabary just because the vowel points happen to have becom

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

2003-11-24 Thread Christopher John Fynn
"Peter Kirk" <[EMAIL PROTECTED]> wrote: > This approach would certainly have simplified pointed Hebrew a lot, so > much so that it could well be serious. After all, Ethiopic was encoded > as a syllabary just because the vowel points happen to have become > attached to the base characters. And w

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

2003-11-24 Thread John Cowan
Peter Kirk scripsit: > This approach would certainly have simplified pointed Hebrew a lot, so > much so that it could well be serious. There are an awful lot of possibilities, and it's not clear that spinning them out a la Hangul really makes sense. > After all, Ethiopic was encoded > as a syll

RE: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

2003-11-24 Thread Philippe Verdy
Kent Karlsson wrote: > Hangul syllables are "LVT" (actually (L+)(V+)(T*)), not TLV. Sorry, I use so often the acronym TLV which means in French "Type, Longueur, Valeur" (and is completely unrelated to Unicode or Hangul syllable types), that this often confuses me with the English LVT for "Leading

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

2003-11-24 Thread Peter Kirk
On 24/11/2003 03:29, Kent Karlsson wrote: ... I wonder why Hangul would need compression over and above any other alphabetic script... It has already quite a lot of compression in the form of precomposed syllables. I think we better start a project for allocating precomposed "syllables" for many

RE: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

2003-11-24 Thread Kent Karlsson
... > >> Of course, no compression format applied to jamos could > >> even do as well as UTF-16 applied to syllables, i.e. 2 bytes per > >> syllable. I wonder why Hangul would need compression over and above any other alphabetic script... It has already quite a lot of compression in the form of p

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

2003-11-23 Thread Doug Ewell
Mark Davis wrote: >> Of course, no compression format applied to jamos could >> even do as well as UTF-16 applied to syllables, i.e. 2 bytes per >> syllable. > > This needs a bit of qualification. An arithmetic compression would do > better, for example, or even just a compression that took the m

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

2003-11-23 Thread Mark Davis
ode Mailing List" <[EMAIL PROTECTED]> Cc: "Jungshik Shin" <[EMAIL PROTECTED]>; "John Cowan" <[EMAIL PROTECTED]> Sent: Sat, 2003 Nov 22 22:53 Subject: Korean compression (was: Re: Ternary search trees for Unicode dictionaries) > Jungshik Shin wrote: &

Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

2003-11-23 Thread Doug Ewell
Jungshik Shin wrote: >> The file they used, called "arirang.txt," contains over 3.3 million >> Unicode characters and was apparently once part of their "Florida >> Tech Corpus of Multi-Lingual Text" but subsequently deleted for >> reasons not known to me. I can supply it if you're interested. >