Re: A UTF-8 based News Service

DougEwell2 Fri, 13 Jul 2001 09:46:23 -0700
In a message dated 2001-07-13 7:00:26 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

>  Sounds promising!  How well does SCSU gzip?

If gzip works anything like PKZIP, the answer is, very well indeed.  This is 
because (using the explanation I have heard before) SCSU retargets Unicode 
text to an 8-bit model, meaning that for small alphabetic scripts (or 
medium-sized syllabic scripts like Ethiopic), most characters are represented 
in one byte, so the information appears 8 bits at a time.  Many 
general-purpose compression algorithms are optimized for that kind of data.

Recently I created a test file of all Unicode characters in code point order 
(excluding the surrogates, but including all the other non-characters).  I 
will admit up front that this is a pathological test case and real-world data 
probably won't behave anywhere near the same.  Having said that, here are the 
file sizes of this data expressed in UTF-8 and SCSU, raw and zipped (using 
PKZIP 4.0):

    Raw UTF-8       4,382,592
    Zipped UTF-8    2,264,152 (52% of raw UTF-8)
    Raw SCSU            1,179,688 (27% of raw UTF-8)
    Zipped SCSU       104,316 (9% of raw SCSU, < 5% of zipped UTF-8)

So PKZIP compressed this particular (non-real-world) UTF-8 data by only 48%, 
but compressed the equivalent SCSU data by a whopping 91%.  That's because 
SCSU puts the data in an 8-bit model, which brings out the best in PKZIP.  
Gzip may work the same.

Note that real-world data would probably be much more useful in making this 
comparison than my sequential-order data, which certainly favors SCSU as it 
minimizes windows switches and creates repetitive patterns.  Also note that 
SCSU compressors are different and the same data encoded with a different 
compressor might yield more or fewer than 1,179,688 bytes.  I used my own 
compressor.

-Doug Ewell
 Fullerton, California
Re: A UTF-8 based News Service

Reply via email to