Just to clear up some possible misconceptions that I think may have
developed:

This thread started when Philippe Verdy mentioned the possibility of
converting certain sequences of Unicode characters to a *canonically
equivalent sequence* to improve compression.  An example was converting
Korean text, encoded with individual jamos, to a precomposed syllable or
a combination of LV syllables plus T jamos.

This type of conversion seems to be permissible under conformance clause
C10, which states (paraphrasing here) that a process may replace a given
character sequence by a canonical-equivalent sequence, and still claim
not to have changed the interpretation of that sequence.

My question was whether a Unicode text compressor could legitimately
convert text to a different canonical-equivalent sequence for purposes
of improving compression, without violating users' expectations of
so-called "lossless" compression.  Some list members pointed out that
the checksum of the compressed-and-decompressed text would not match the
original, and wondered about possible security concerns.

I've been waiting to see if any UTC members, or other experts in
conformance or compression issues, had anything to say about this.  So
far, the only such response has been from Mark Davis, who said that "a
compressor can normalize, if (a) when decompressing it produces NFC, and
(b) it advertises that it normalizes."

To clarify what I am NOT looking for:

(1)  I am interested in the applicability of C10 to EXISTING compression
techniques, such as for SCSU or BOCU-1, or for general-purpose
algorithms like Huffman and LZ.  Any approach that requires existing
*decompressors* to be modified in order to undo the new transformation
is NOT of interest.  That amounts to inventing a new compression scheme.

(2)  I am NOT interested in inventing a new normalization form, or any
variants on existing forms.  Any approach that involves compatibility
equivalences, ignores the Composition Exclusions table, or creates
equivalences that do not exist in the Unicode Character Database (such
as "U+1109 + U+1109 = U+110A") is NOT of interest.  That amounts to
unilaterally extending C10, which may already be too liberal to be
applied to compression.

Note that (1) and (2) are closely related.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/


Reply via email to