RE: Compression through normalization

Philippe Verdy Tue, 25 Nov 2003 16:34:52 -0800

Doug Ewell writes:
> * Philippe Verdy and and Jill Ramonsky say YES, a compressor can
> normalize, because it knows it is operating on Unicode character data
> and can take advantage of Unicode properties.


I say YES only for compressors that are supposed to work on Unicode text
(this applies to BOCU-1 and SCSU which are not intented to compress anything
else than Unicode text), but NO of course for general purpose compressors
(like deflate in zip files.)

I will say NO for encoding forms that are normally built to be directly
parsable code point by codepoint in any direction and from random locations
in strings. So a UTF encoding scheme is not supposed to change the
normalization form.

> * Peter Kirk and Mark Shoulson say NO, it can't, because all the
> compressor really knows about is the byte stream, so it must be
> preserved byte-for-byte.

But SCSU and BOCU-1 do not operate in the byte stream level, as their use is
invalid on random streams of bytes, but only defined in terms of streams of
code units... That's why I won't say that SCSU and BOCU-1 are really
compressors, but rather really encoding schemes (CES in the ISO10646
terminology).

In fact the result of BOCU-1 and SCSU encoding schemes can create a file
which has its own charset (i.e. CCS+CES in the ISO terminology), and thus
can also have its own label for MIME usage or in XML charset declarations.
This is not a limitation, as true compressors can still be used if needed
from this encoding scheme, or transparently within transport layers (such as
the "Content-Transfer-Encoding:" in MIME and HTTP applications).

> * I'm still not sure, but I'm leaning toward NO.



__________________________________________________________________
<< ella for Spam Control >> has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE!  http://www.ellaforspam.com

<<attachment: winmail.dat>>

RE: Compression through normalization

Reply via email to