subject:"Re\: Korean compression \(was\: Re\: Ternary search trees for Unicode dictionaries\)"

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

2003-12-03 Thread Asmus Freytag


- Original Message -
From: "Frank Yung-Fong Tang" <[EMAIL PROTECTED]>
 > > >> UTF-166,634,430 bytes
 > > >> UTF-87,637,601 bytes
 > > >> SCSU6,414,319 bytes
 > > >> BOCU-15,897,258 bytes
 > > >> Legacy encoding (*)5,477,432 bytes
 > > >> (*) KS C 5601, KS X 1001, or EUC-KR)
What is the size of gzip these? Just wonder
gzip of UTF-16
gzip of UTF-8
gzip of SCSU
gzip of BOCU-1
gzip of Legacy encoding
Based on the principles that underly the gzip encoding, and on the fact 
that the UTF-8 encoding has many three-byte combinations, while UTF-16 / 
SCSU / BOCU-1/ Legacy have two byte combinations for the same characters, I 
expect that the *relative* size of the gzipped results will (within 
ignorable fluctuation) approximately track the relative size of the 
un-zipped versions, with perhaps, an extra penalty for utf-8 due to the 
24-bit combinations interacting worse with the gzip architecture than the 
16-bit combinations. But that's speculation.

From the work of Atkins et. al. as reported by Doug Ewell I would further 
expect that BW type compression would give (practically) indistinguishable 
results for all five cases, as BW has been shown to be particularly 
encoding form insensitive, unlike Huffman or gzip which work best with true 
8-bit symbols.

A./

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

2003-12-02 Thread Doug Ewell

Frank Yung-Fong Tang  wrote:

>> UTF-166,634,430 bytes
>> UTF-87,637,601 bytes
>> SCSU6,414,319 bytes
>> BOCU-15,897,258 bytes
>> Legacy encoding (*)5,477,432 bytes
>> (*) KS C 5601, KS X 1001, or EUC-KR)
>
> What is the size of gzip these? Just wonder
> gzip of UTF-16
> gzip of UTF-8
> gzip of SCSU
> gzip of BOCU-1
> gzip of Legacy encoding

I don't have gzip, but I can give you the PKZip sizes, which should be
quite similar:

UTF-162,685,232 bytes
UTF-8 2,774,356 bytes
SCSU  2,756,470 bytes
BOCU-12,772,418 bytes
EUC-KR2,518,201 bytes

Note that the largest of these is only 10.2% larger than the smallest.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

2003-12-02 Thread Mark Davis

Someone else originated that list.

Mark
__
http://www.macchiato.com
â à â

- Original Message - 
From: "Frank Yung-Fong Tang" <[EMAIL PROTECTED]>
To: "Mark Davis" <[EMAIL PROTECTED]>
Cc: "Doug Ewell" <[EMAIL PROTECTED]>; "Unicode Mailing List"
<[EMAIL PROTECTED]>; "Jungshik Shin" <[EMAIL PROTECTED]>; "John Cowan"
<[EMAIL PROTECTED]>
Sent: Tue, 2003 Dec 02 15:03
Subject: Re: Korean compression (was: Re: Ternary search trees for Unicode
dictionaries)




Mark Davis wrote:

 > > >> UTF-166,634,430 bytes
 > > >> UTF-87,637,601 bytes
 > > >> SCSU6,414,319 bytes
 > > >> BOCU-15,897,258 bytes
 > > >> Legacy encoding (*)5,477,432 bytes
 > > >> (*) KS C 5601, KS X 1001, or EUC-KR)

What is the size of gzip these? Just wonder
gzip of UTF-16
gzip of UTF-8
gzip of SCSU
gzip of BOCU-1
gzip of Legacy encoding

-- 
--
Frank Yung-Fong Tang
ÅÃÅtÃm ÃrÃhÃtÃÃt, IÃtÃrnÃtiÃnÃl DÃvÃlÃpmeÃt, AOL IntÃrÃÃtÃvÃ 
SÃrviÃes
AIM:yungfongta   mailto:[EMAIL PROTECTED] Tel:650-937-2913
Yahoo! Msg: frankyungfongtan

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

2003-12-02 Thread Frank Yung-Fong Tang



Mark Davis wrote:

 > > >> UTF-166,634,430 bytes
 > > >> UTF-87,637,601 bytes
 > > >> SCSU6,414,319 bytes
 > > >> BOCU-15,897,258 bytes
 > > >> Legacy encoding (*)5,477,432 bytes
 > > >> (*) KS C 5601, KS X 1001, or EUC-KR)

What is the size of gzip these? Just wonder
gzip of UTF-16
gzip of UTF-8
gzip of SCSU
gzip of BOCU-1
gzip of Legacy encoding

-- 
--
Frank Yung-Fong Tang
ÅÃÅtÃm ÃrÃhÃtÃÃt, IÃtÃrnÃtiÃnÃl DÃvÃlÃpmeÃt, AOL IntÃrÃÃtÃvÃ 
SÃrviÃes
AIM:yungfongta   mailto:[EMAIL PROTECTED] Tel:650-937-2913
Yahoo! Msg: frankyungfongtan

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

2003-11-25 Thread Doug Ewell

Philippe Verdy  wrote:

> The question of Latin letters with two diacritics added in Latin
> Extension B does not seem to respect this constraint, as it is not
> justifed in the Vietnames VISCII standard that already does not
> contain characters with two diacritics, but already composes them
> with two characters in the limited CCS set.

Not true.  If you like, I can send you a copy of the VISCII report
showing not only the mappings, but also their justification.  The
Viet-Std organization went to great lengths to avoid combining
characters, even, as John said, to the point of encoding six graphic
characters in the C-zero control area.

Perhaps you are thinking of Windows code page 1258, which includes many
precomposed letters, but none in the Latin Extension B block, and does
require combining marks for vowels with two diacritics.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

2003-11-25 Thread John Cowan

Philippe Verdy scripsit:

> The question of Latin letters with two diacritics added in Latin Extension B
> does not seem to respect this constraint, as it is not justifed in the
> Vietnames VISCII standard that already does not contain characters with two
> diacritics, but already composes them with two characters in the limited CCS
> set.

I'm not sure what standard you are referring to.  There are three standards
for Vietnamese text:  VISCII 1.1 (de facto), TCVN 5712-1 (aka VSCII-1),
and TCVN 5712-2 (aka VSCII-2).  VISCII provides no combining characters,
fills the C1 space with graphics, and even replaces certain C0 characters
with graphics.  5712-1 provides combining characters and fills the C1
space with graphics.  5712-2 provides combining characters and leaves
both C0 and C1 clear of graphics (and so is ISO 2022-compatible).  But
all of them provide at least some characters with double diacritics.

> I don't know why even ISO10646 would have needed them, unless there's some
> Vietnamese DBCS standard that allows representing in a 94x94 matrix all
> letters with two diacritics as well as Han ideographs used in Vietnamese. 

I very much doubt that any such encoding ever existed.

-- 
What is the sound of Perl?  Is it not the   John Cowan
sound of a [Ww]all that people have stopped [EMAIL PROTECTED]
banging their head against?  --Larryhttp://www.ccil.org/~cowan

RE: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

2003-11-25 Thread Philippe Verdy

John Cowan writes:
> > You are, because the floodgates, while once open, have been closed by 
> > normalization.
> 
> Indeed, they were opened in Unicode 1.1, as a result of the merger with
> FDIS 10646; since then, only 46 characters with canonical decompositions
> have been added to Unicode (excepting compatibility ideographs, which
> are a special case).

In fact ISO10646 is to allow an easy one-to-one mapping from existing
standard coded character sets (CCS) and unified code points. Accepting
precomposed characters is then a necessity when there exists precomposed
characters in legacy CCS standard. But they are included only for
compatibility (exactly like for compatibility ideographs).

The question of Latin letters with two diacritics added in Latin Extension B
does not seem to respect this constraint, as it is not justifed in the
Vietnames VISCII standard that already does not contain characters with two
diacritics, but already composes them with two characters in the limited CCS
set.
I don't know why even ISO10646 would have needed them, unless there's some
Vietnamese DBCS standard that allows representing in a 94x94 matrix all
letters with two diacritics as well as Han ideographs used in Vietnamese. I
looked within the IBM database of charsets (CCS+CES), and could not find
such reference to  such EUC-style DBCS. So was it because there was an
ongoing/unterminated DBCS standard for Vietnamese, working like GBK, SJIS or
KSC 5601 ?

__
<< ella for Spam Control >> has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE!  http://www.ellaforspam.com
<>

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

2003-11-25 Thread Michael Everson

At 08:23 -0500 2003-11-25, John Cowan wrote:
Michael Everson scripsit:

 Ridiculous. This happened centuries ago, and it is not "why" Ethiopic
 was encoded as a syllabary. It was encoded as a syllabary because it
 is a syllabary.
Structurally it's an abugida, like Indic and UCAS.
I disagree. And I don't think Canadian Syllabics are an abugida. But 
let's leave this one alone, shall we?
--
Michael Everson * * Everson Typography *  * http://www.evertype.com

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

2003-11-25 Thread John Cowan

Michael Everson scripsit:

> Ridiculous. This happened centuries ago, and it is not "why" Ethiopic 
> was encoded as a syllabary. It was encoded as a syllabary because it 
> is a syllabary.

Structurally it's an abugida, like Indic and UCAS.

> You are, because the floodgates, while once open, have been closed by 
> normalization.

Indeed, they were opened in Unicode 1.1, as a result of the merger with
FDIS 10646; since then, only 46 characters with canonical decompositions
have been added to Unicode (excepting compatibility ideographs, which
are a special case).

Specifically, 16 were added in Unicode 2.0, 29 in Unicode 1.0, and
just one in Unicode 3.2 (the slashed version of a symbol added at the
same time).

-- 
"What has four pairs of pants, livesJohn Cowan
in Philadelphia, and it never rains http://www.reutershealth.com
but it pours?"  [EMAIL PROTECTED]
--Rufus T. Firefly  http://www.ccil.org/~cowan

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

2003-11-25 Thread Peter Kirk

On 25/11/2003 03:54, Michael Everson wrote:

At 03:41 -0800 2003-11-25, Peter Kirk wrote:

...

But the floodgates have already been opened - not just Ethiopic but 
Greek extended, much of Latin extended, the Korean syllables which 
started this discussion, the small amount of precomposed Hebrew which 
we already have, etc. People have tried to force them shut, and with 
good reason. But don't accuse me of starting something new.


You are, because the floodgates, while once open, have been closed by 
normalization.


Read what I wrote before:

This approach would certainly have simplified pointed Hebrew a lot, 
... But I guess it is too late for a change now!
I recognised clearly that it is too late to make this change now, 
although it might have been a good thing to do when the floodgates were 
open (although as Mark pointed out it would not necessarily have made 
things easier). I don't want to reopen them.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

2003-11-25 Thread Michael Everson

At 03:41 -0800 2003-11-25, Peter Kirk wrote:

After all, Ethiopic was encoded as a syllabary just because the 
vowel points happen to have become attached to the base characters.
Ridiculous. This happened centuries ago, and it is not "why" Ethiopic 
was encoded as a syllabary. It was encoded as a syllabary because it 
is a syllabary.

But the floodgates have already been opened - not just Ethiopic but 
Greek extended, much of Latin extended, the Korean syllables which 
started this discussion, the small amount of precomposed Hebrew 
which we already have, etc. People have tried to force them shut, 
and with good reason. But don't accuse me of starting something new.
You are, because the floodgates, while once open, have been closed by 
normalization.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

2003-11-25 Thread Peter Kirk

On 24/11/2003 17:56, Christopher John Fynn wrote:

"Peter Kirk" <[EMAIL PROTECTED]> wrote:

 

This approach would certainly have simplified pointed Hebrew a lot, so
much so that it could well be serious. After all, Ethiopic was encoded
as a syllabary just because the vowel points happen to have become
attached to the base characters. And we already have some precomposed
Hebrew syllables, FB1D, FB1F, FB2E, FB2F. But I guess it is too late for
a change now!
   

Please don't even think of it -  acceptance of any proposal for precomposed
characters for one script would open the floodgates for similar proposals for
other scripts.
--
Christopher J. Fynn


 

But the floodgates have already been opened - not just Ethiopic but 
Greek extended, much of Latin extended, the Korean syllables which 
started this discussion, the small amount of precomposed Hebrew which we 
already have, etc. People have tried to force them shut, and with good 
reason. But don't accuse me of starting something new.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

RE: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

2003-11-25 Thread Philippe Verdy

Christopher John Fynn wrote:
> "Peter Kirk" <[EMAIL PROTECTED]> wrote:
> 
> > This approach would certainly have simplified pointed Hebrew a lot, so
> > much so that it could well be serious. After all, Ethiopic was encoded
> > as a syllabary just because the vowel points happen to have become
> > attached to the base characters. And we already have some precomposed
> > Hebrew syllables, FB1D, FB1F, FB2E, FB2F. But I guess it is too late for
> > a change now!
> 
> Please don't even think of it -  acceptance of any proposal for 
> precomposed characters for one script would open the floodgates
> for similar proposals for other scripts.

Isn't it what happened to the Latin script with floods of precomposed
characters, notably letters with double accents which were not necessary to
support and map correctly the VISCII standard?

Floodgates are already opened, but the composition stability policy would
require that most additional precomposed characters would need to be
excluded from normalized composition forms. As such introduction of
precomposed but excluded characters would not occur in any normalized text,
it would be justified only to support bijective mapping with another
standard that allows distinction between composed and decomposed
characters...

__
<< ella for Spam Control >> has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE!  http://www.ellaforspam.com
<>

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

2003-11-24 Thread Mark E. Shoulson

On 11/24/03 20:56, Christopher John Fynn wrote:

"Peter Kirk" <[EMAIL PROTECTED]> wrote:
 

This approach would certainly have simplified pointed Hebrew a lot, so
much so that it could well be serious. After all, Ethiopic was encoded
as a syllabary just because the vowel points happen to have become
attached to the base characters. And we already have some precomposed
Hebrew syllables, FB1D, FB1F, FB2E, FB2F. But I guess it is too late for
a change now!
   

Please don't even think of it -  acceptance of any proposal for precomposed
characters for one script would open the floodgates for similar proposals for
other scripts.
I really don't think this is a good model for Hebrew anyway.  Besides, 
if you think the weird exceptions of Biblical typesetting are a pain 
with the current cons+vow model, imagine what a nightmare they'd be with 
precomposed syllables.

~mark

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

2003-11-24 Thread Christopher John Fynn




"Peter Kirk" <[EMAIL PROTECTED]> wrote:

> This approach would certainly have simplified pointed Hebrew a lot, so
> much so that it could well be serious. After all, Ethiopic was encoded
> as a syllabary just because the vowel points happen to have become
> attached to the base characters. And we already have some precomposed
> Hebrew syllables, FB1D, FB1F, FB2E, FB2F. But I guess it is too late for
> a change now!

Please don't even think of it -  acceptance of any proposal for precomposed
characters for one script would open the floodgates for similar proposals for
other scripts.

 --
Christopher J. Fynn

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

2003-11-24 Thread John Cowan

Peter Kirk scripsit:

> This approach would certainly have simplified pointed Hebrew a lot, so 
> much so that it could well be serious.

There are an awful lot of possibilities, and it's not clear that spinning
them out a la Hangul really makes sense.

> After all, Ethiopic was encoded
> as a syllabary just because the vowel points happen to have become 
> attached to the base characters. 

Well, more because Ethiopic-script users think of the letters as
part of a syllabary, though historically it's an abugida.  The original
design for Unicode Ethiopic used an alphabetic representation -- someone
else can probably tell you more about the nitty-gritty of why it was
rejected.

> And we already have some precomposed
> Hebrew syllables, FB1D, FB1F, FB2E, FB2F. But I guess it is too late for 
> a change now!

It certainly is.

-- 
Go, and never darken my towels again!   John Cowan
--Rufus T. Firefly  www.ccil.org/~cowan

RE: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

2003-11-24 Thread Philippe Verdy

Kent Karlsson wrote:

> Hangul syllables are "LVT" (actually (L+)(V+)(T*)), not TLV.

Sorry, I use so often the acronym TLV which means in French "Type, Longueur,
Valeur" (and is completely unrelated to Unicode or Hangul syllable types),
that this often confuses me with the English LVT for "Leading consonnant,
Vowel, Trailing consonnant". Acronyms are so much misleading...


__
<< ella for Spam Control >> has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE!  http://www.ellaforspam.com
<>

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

2003-11-24 Thread Peter Kirk

On 24/11/2003 03:29, Kent Karlsson wrote:

...

I wonder why Hangul would need compression over and above
any other alphabetic script... It has already quite a lot of compression
in the form of precomposed syllables. I think we better start a project
for allocating precomposed "syllables" for many other scripts,
...

No, this was not serious ;-)
/kent k
 

This approach would certainly have simplified pointed Hebrew a lot, so 
much so that it could well be serious. After all, Ethiopic was encoded 
as a syllabary just because the vowel points happen to have become 
attached to the base characters. And we already have some precomposed 
Hebrew syllables, FB1D, FB1F, FB2E, FB2F. But I guess it is too late for 
a change now!

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

RE: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

2003-11-24 Thread Kent Karlsson


...
> >> Of course, no compression format applied to jamos could
> >> even do as well as UTF-16 applied to syllables, i.e. 2 bytes per
> >> syllable.

I wonder why Hangul would need compression over and above
any other alphabetic script... It has already quite a lot of compression
in the form of precomposed syllables. I think we better start a project
for allocating precomposed "syllables" for many other scripts,
precomposed Latin script syllables, precomposed Greek script
syllables, precomposed Tamil script syllables (most of the Brahmic
derived scripts are especially disadvantaged, from a 'compression'
viewpoint by the virama characters), etc. That should take up much
of the excess space in the unused planes (3-13, decimal).
Unfortunately that mean 4 bytes per non-Hangul syllable (before
byte oriented compression is done), but that could be compensated
by using an SCSU-like approach, just with bigger windows.

No, this was not serious ;-)
/kent k

PS
Hangul syllables are "LVT" (actually (L+)(V+)(T*)), not TLV.

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

2003-11-23 Thread Doug Ewell

Mark Davis  wrote:

>> Of course, no compression format applied to jamos could
>> even do as well as UTF-16 applied to syllables, i.e. 2 bytes per
>> syllable.
>
> This needs a bit of qualification. An arithmetic compression would do
> better, for example, or even just a compression that took the most
> frequent jamo sequences. Perhaps the above is better phrased as 'no
> simple byte-level compression format...'.

Yes, that's what I meant: a compression *format* like SCSU or BOCU-1, as
opposed to a (general-purpose) compression *algorithm* like Huffman or
LZ or arithmetic coding.  The distinction makes sense in the context of
my paper, but I probably should have explained it here.

BTW, the paper is awaiting final comments from one last reviewer.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

2003-11-23 Thread Mark Davis

>Of course, no compression format applied to jamos could
> even do as well as UTF-16 applied to syllables, i.e. 2 bytes per
> syllable.

This needs a bit of qualification. An arithmetic compression would do better,
for example, or even just a compression that took the most frequent jamo
sequences. Perhaps the above is better phrased as 'no simple byte-level
compression format...'.

Mark
__
http://www.macchiato.com
â à â

- Original Message - 
From: "Doug Ewell" <[EMAIL PROTECTED]>
To: "Unicode Mailing List" <[EMAIL PROTECTED]>
Cc: "Jungshik Shin" <[EMAIL PROTECTED]>; "John Cowan" <[EMAIL PROTECTED]>
Sent: Sat, 2003 Nov 22 22:53
Subject: Korean compression (was: Re: Ternary search trees for Unicode
dictionaries)


> Jungshik Shin  wrote:
>
> >> The file they used, called "arirang.txt," contains over 3.3 million
> >> Unicode characters and was apparently once part of their "Florida
> >> Tech Corpus of Multi-Lingual Text" but subsequently deleted for
> >> reasons not known to me.  I can supply it if you're interested.
> >
> > It'd be great if you could.
>
> Try
> http://www.cs.fit.edu/~ryan/compress/corpora/korean/arirang/arirang.txt
> first.  If that doesn't work, I'll send you a copy.  It's over 5
> megabytes, so I'd like to avoid that if possible.
>
> >> The statistics on this file are as follows:
> >>
> >> UTF-166,634,430 bytes
> >> UTF-87,637,601 bytes
> >> SCSU6,414,319 bytes
> >> BOCU-15,897,258 bytes
> >> Legacy encoding (*)5,477,432 bytes
> >> (*) KS C 5601, KS X 1001, or EUC-KR)
> >
> > Sorry to pick on this (when I have to thank you). Even with
> > coded character set vs character encoding scheme distinction aside
> > (that is, we just think in terms of character repertoire), KS C 5601/
> > KS X 1001 _alone_ cannot represent any Korean text unless you're
> > willing to live with double width space, Latin letters, numbers and
> > punctuations (since you wrote the file has apparently full stops and
> > spaces in ASCII, it does include characters outside KS X 1001)  On the
> > other hand, EUC-KR (KS X 1001 + ISO 646:KR/US-ASCII) can. Actually, I
> > suspect the legacy encoding used was Windows codepage 949(or JOHAB/
> > Windows-1361?) because I can't imagine there is not a single syllable
> > (that is outside the charater repertoire of KS X 1001) out of over 2
> > million syllables
>
> Sorry, I should have noticed on Atkin and Stansifer's data page
> (http://www.cs.fit.edu/~ryan/compress/) that the file is in EUC-KR.  All
> I knew was that I was able to import it into SC UniPad using the option
> marked "KS C 5601 / KS X 1001, EUC-KR (Korean)".
>
> >> I used my own SCSU encoder to achieve these results, but it really
> >> wouldn't matter which was chosen -- Korean syllables can be encoded
> >> in SCSU *only* by using Unicode mode.  It's not possible to set a
> >> window to the Korean syllable range.
> >
> > Now that you told me you used NFC, isn't this condition similar to
> > Chinese text? How does BOCU and SCSU work for Chinese text?  Japanese
> > text might do slightly better with Kana, but isn't likely to be much
> > better.
>
> Well, *I* didn't use NFC for anything.  That's just how the file came to
> me.  And yes, the situation is exactly the same for Chinese text, except
> I suppose that with 20,000-some basic Unihan characters, plus Extension
> A and B, plus the compatibility guys starting at U+F900, one might not
> realistically expect any better than 16 bits per character.  OTOH, when
> dealing with 11,171 Hangul syllables interspersed with Basic Latin, I
> imagine there is some room for improvement over UTF-16.
>
> I'm intrigued by the improved performance of BOCU-1 on Korean text, and
> I'm now interested in finding a way to achieve even better compression
> of Hangul syllables, using a strategy *not* much more complex than SCSU
> or BOCU and *not* involving huge reordering tables.  Your assistance,
> and anyone else's, would be welcome.  Googling for "Korean compression"
> or "Hang[e]ul compression" turns up practically nothing, so there is a
> chance to break some new ground here.
>
> John Cowan  responded to Jungshik's
> comment about Kana:
>
> > The SCSU paper claims that Japanese does *much* better in SCSU than
> > UTF-16, thanks to the kana.
>
> The example in Section 9.3 would appear to substantiate that claim, as
> 116 Unicode characters (= 232 bytes of UTF-16) are compressed to 178
> bytes of SCSU.
>
> Back to Jungshik:
>
> >> Only the large number of spaces and full
> >> stops in this file prevented SCSU from degenerating entirely to 2
> >> bytes per character.
> >
> > That's why I asked. What I'm curious about is how SCSU and BOCU
> > of NFD (and what I and Kent [2] think should have been NFD with the
> > possible code point rearragement of Jamo block to facilate a smaller
> > window size for SCSU) would compare with uncompressed UTF-16 of NFC
> > (SCSU/BOCU isn't much better than UTF-16).  The b

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

RE: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

RE: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

RE: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

RE: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

21 matches

Site Navigation

Mail list logo

Footer information