Re: Compression through normalization

Kenneth Whistler Thu, 04 Dec 2003 17:03:24 -0800

Mark said:

> The operations of compression followed by decompression can conformantly produce
> any text that is canonically equivalent to the original without purporting to
> modify the text. (How the internal compressed format is determined is completely
> arbitrary - it could NFD, compress, decompress, NFC; swap alternate bits; remap
> modern jamo and LV's to a contiguous range, BOCU-1 it; whatever). In practice,
> if a compressor does not produce codepoint-identical text, it should produce NFC
> (not just any canonically equivalent text), and should document that it does so.


Perhaps to help clear everyone's thinking here, it might help
to further paraphrase what Mark said:

The operations of XXX followed by YYY can conformantly produce
any text that is canonically equivalent to the original while purporting 
not to modify the interpretation of the text. 

[For "operations of XXX followed by YYY" feel free to substitute anything
you like. This is not about compression per se, but is the fundamental
meaning of canonical equivalence. If the resultant output text is
*canonically equivalent* to the original text, then the process has
not modified the *interpretation* of the text. Note that I expanded Mark's
formulation slightly -- his was still slightly too telegraphic.

It may, on the other hand have *changed* the text, of course. Canonical
equivalents may be shorter or longer, and consist of different code
point sequences.]

How the data format following operation XXX and preceding YYY is determined
is completely arbitrary - it could be blargle or bleep or flassiwary;
swap alternate bits; remap fleebert to whazzit; compress it; whatever.

In *practice* [note this is a recommendation, and not a conformance
requirement], if a text operation produces canonically equivalent
text which is not codepoint-identical, it *should* produce a
normalized form of the text and should document that it does so.

Does that help any? This really is not about compression at all --
it is about understanding what the conformance requirements of
the standard are.

Canonical equivalence is about not modifying the interpretation
of the text. That is different from considerations about not
changing the text, period.

If some process using text is sensitive to *any* change in the
text whatsover (CRC-checking or any form of digital signaturing,
memory allocation), then, of course, *any* change to the text, 
including any normalization, will make a difference.

If some process using text is sensitive to the *interpretation* of
the text, i.e. it is concerned about the content and meaning of
the letters involved, then normalization, to forms NFC or NFD,
which only involve canonical equivalences, will *not* make a difference.
Or to be more subtle about it, it might make a difference, but it
is nonconformant to claim that a process which claims it does not
make a difference is nonconformant.

If you can parse that last sentence, then you are well on the
way to understanding the Tao of Unicode.

--Ken

Re: Compression through normalization

Reply via email to