Re: Compression through normalization

Doug Ewell Tue, 25 Nov 2003 17:03:54 -0800

Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

> I say YES only for compressors that are supposed to work on Unicode
> text (this applies to BOCU-1 and SCSU which are not intented to
> compress anything else than Unicode text), but NO of course for
> general purpose compressors (like deflate in zip files.)


Of course.

> I will say NO for encoding forms that are normally built to be
> directly parsable code point by codepoint in any direction and from
> random locations in strings. So a UTF encoding scheme is not supposed
> to change the normalization form.

Of course not.  Or so I would imagine, anyway.  After all, if a process
(see Peter Kirk's question) that compresses Unicode text can silently
change the normalization form, then why not a process that stores and
retrieves Unicode text using, say, UTF-8?  But that sounds wrong to me,
although it's what C10 says.

>> * Peter Kirk and Mark Shoulson say NO, it can't, because all the
>> compressor really knows about is the byte stream, so it must be
>> preserved byte-for-byte.
>
> But SCSU and BOCU-1 do not operate in the byte stream level, as their
> use is invalid on random streams of bytes, but only defined in terms
> of streams of code units...

That's right.  I tend to agree with the NO camp not because SCSU and
BOCU-1 are going to be applied to arbitrary binary data, but because the
*format* in which text is stored isn't normally expected to change the
contents.

Converting Unicode text from UTF-16LE to UTF-16BE, or UTF-16 to UTF-8,
changes the bits.  Everyone can see that.  But the *code units*
represented by those bits are not changed.  If the UTF-16BE sequence <00
61 03 01> were converted to the UTF-8 sequence <C3 A1>, that would be a
change not only in the bits, but in the code units as well.  This is
where the question lies.

> That's why I won't say that SCSU and BOCU-1 are really compressors,
> but rather really encoding schemes (CES in the ISO10646 terminology).

They are transfer encoding syntaxes (TES).  And I believe this
terminology is from Unicode, not 10646, though I could be wrong.

I would say encoders for SCSU and BOCU-1 are compressors.  They're just
not general-purpose compressors.

> In fact the result of BOCU-1 and SCSU encoding schemes can create a
> file which has its own charset (i.e. CCS+CES in the ISO terminology),
> and thus can also have its own label for MIME usage or in XML charset
> declarations. This is not a limitation, as true compressors can still
> be used if needed from this encoding scheme, or transparently within
> transport layers (such as the "Content-Transfer-Encoding:" in MIME and
> HTTP applications).

Yes, you can take SCSU- or BOCU-1-encoded text and recompress it using a
GP compression scheme.  Atkin and Stansifer's paper from last year is
all about that, and I spend a few pages on it in my paper as well.  You
can also re-Zip a Zip file, though, so I don't know what that proves
about the compression formats.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

Re: Compression through normalization

Reply via email to