Re: Compression through normalization

Kenneth Whistler Fri, 05 Dec 2003 15:12:37 -0800

Doug asked:

> Mark indicated that a compression-decompression cycle should not only
> stick to canonical-equivalent sequences, which is what C10 requires, but
> should convert text only to NFC (if at all).  Ken mentioned
> normalization "to forms NFC or NFD," but I'm not sure this was in the
> same context.  (Can we find a consensus on this?)


I don't think either of our recommendations here are specific
to compression issues.

Basically, if a process tinkers around with changing sequences
to their canonical equivalents, then it is advisable that
the end result actually *be* in one of the normalization
forms, either NFD or NFC, and that this be explicitly documented
as what the process does. Otherwise, you are just tinkering
and leaving the data in an indeterminate (although still
canonically equivalent) state.

Mark recommended NFC in particular, since that is the "least
marked" (*hehe*) normalization form, i.e., the one that you
are most likely to encounter, and the one that most Internet
or web processes are likely to prefer.

--Ken

P.S. On the other hand, if you asked him nicely, Mark might
find the more marked form, NFD, to his liking, especially
since it is likely to contain more combining marks. Mark
is definitely in favor of markedness. I, on the other hand,
am definitely in favor of kennings, but we have found little
practical or architectural use for them in the Unicode
character-sea.

Re: Compression through normalization

Reply via email to