Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

Asmus Freytag Wed, 03 Dec 2003 04:44:20 -0800

----- Original Message -----
From: "Frank Yung-Fong Tang" <[EMAIL PROTECTED]>


 > > >> UTF-16    6,634,430 bytes
 > > >> UTF-8    7,637,601 bytes
 > > >> SCSU    6,414,319 bytes
 > > >> BOCU-1    5,897,258 bytes
 > > >> Legacy encoding (*)    5,477,432 bytes
 > > >>     (*) KS C 5601, KS X 1001, or EUC-KR)

What is the size of gzip these? Just wonder
gzip of UTF-16
gzip of UTF-8
gzip of SCSU
gzip of BOCU-1
gzip of Legacy encoding

Based on the principles that underly the gzip encoding, and on the fact that the UTF-8 encoding has many three-byte combinations, while UTF-16 / SCSU / BOCU-1/ Legacy have two byte combinations for the same characters, I expect that the *relative* size of the gzipped results will (within ignorable fluctuation) approximately track the relative size of the un-zipped versions, with perhaps, an extra penalty for utf-8 due to the 24-bit combinations interacting worse with the gzip architecture than the 16-bit combinations. But that's speculation.

From the work of Atkins et. al. as reported by Doug Ewell I would further expect that BW type compression would give (practically) indistinguishable results for all five cases, as BW has been shown to be particularly encoding form insensitive, unlike Huffman or gzip which work best with true 8-bit symbols.

A./

Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

Reply via email to