Just to clear up some possible misconceptions that I think may have developed:
This thread started when Philippe Verdy mentioned the possibility of converting certain sequences of Unicode characters to a *canonically equivalent sequence* to improve compression. An example was converting Korean text, encoded with individual jamos, to a precomposed syllable or a combination of LV syllables plus T jamos. This type of conversion seems to be permissible under conformance clause C10, which states (paraphrasing here) that a process may replace a given character sequence by a canonical-equivalent sequence, and still claim not to have changed the interpretation of that sequence. My question was whether a Unicode text compressor could legitimately convert text to a different canonical-equivalent sequence for purposes of improving compression, without violating users' expectations of so-called "lossless" compression. Some list members pointed out that the checksum of the compressed-and-decompressed text would not match the original, and wondered about possible security concerns. I've been waiting to see if any UTC members, or other experts in conformance or compression issues, had anything to say about this. So far, the only such response has been from Mark Davis, who said that "a compressor can normalize, if (a) when decompressing it produces NFC, and (b) it advertises that it normalizes." To clarify what I am NOT looking for: (1) I am interested in the applicability of C10 to EXISTING compression techniques, such as for SCSU or BOCU-1, or for general-purpose algorithms like Huffman and LZ. Any approach that requires existing *decompressors* to be modified in order to undo the new transformation is NOT of interest. That amounts to inventing a new compression scheme. (2) I am NOT interested in inventing a new normalization form, or any variants on existing forms. Any approach that involves compatibility equivalences, ignores the Composition Exclusions table, or creates equivalences that do not exist in the Unicode Character Database (such as "U+1109 + U+1109 = U+110A") is NOT of interest. That amounts to unilaterally extending C10, which may already be too liberal to be applied to compression. Note that (1) and (2) are closely related. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/