Mark said: > The operations of compression followed by decompression can conformantly produce > any text that is canonically equivalent to the original without purporting to > modify the text. (How the internal compressed format is determined is completely > arbitrary - it could NFD, compress, decompress, NFC; swap alternate bits; remap > modern jamo and LV's to a contiguous range, BOCU-1 it; whatever). In practice, > if a compressor does not produce codepoint-identical text, it should produce NFC > (not just any canonically equivalent text), and should document that it does so.
Perhaps to help clear everyone's thinking here, it might help to further paraphrase what Mark said: The operations of XXX followed by YYY can conformantly produce any text that is canonically equivalent to the original while purporting not to modify the interpretation of the text. [For "operations of XXX followed by YYY" feel free to substitute anything you like. This is not about compression per se, but is the fundamental meaning of canonical equivalence. If the resultant output text is *canonically equivalent* to the original text, then the process has not modified the *interpretation* of the text. Note that I expanded Mark's formulation slightly -- his was still slightly too telegraphic. It may, on the other hand have *changed* the text, of course. Canonical equivalents may be shorter or longer, and consist of different code point sequences.] How the data format following operation XXX and preceding YYY is determined is completely arbitrary - it could be blargle or bleep or flassiwary; swap alternate bits; remap fleebert to whazzit; compress it; whatever. In *practice* [note this is a recommendation, and not a conformance requirement], if a text operation produces canonically equivalent text which is not codepoint-identical, it *should* produce a normalized form of the text and should document that it does so. Does that help any? This really is not about compression at all -- it is about understanding what the conformance requirements of the standard are. Canonical equivalence is about not modifying the interpretation of the text. That is different from considerations about not changing the text, period. If some process using text is sensitive to *any* change in the text whatsover (CRC-checking or any form of digital signaturing, memory allocation), then, of course, *any* change to the text, including any normalization, will make a difference. If some process using text is sensitive to the *interpretation* of the text, i.e. it is concerned about the content and meaning of the letters involved, then normalization, to forms NFC or NFD, which only involve canonical equivalences, will *not* make a difference. Or to be more subtle about it, it might make a difference, but it is nonconformant to claim that a process which claims it does not make a difference is nonconformant. If you can parse that last sentence, then you are well on the way to understanding the Tao of Unicode. --Ken