Doug asked: > Mark indicated that a compression-decompression cycle should not only > stick to canonical-equivalent sequences, which is what C10 requires, but > should convert text only to NFC (if at all). Ken mentioned > normalization "to forms NFC or NFD," but I'm not sure this was in the > same context. (Can we find a consensus on this?)
I don't think either of our recommendations here are specific to compression issues. Basically, if a process tinkers around with changing sequences to their canonical equivalents, then it is advisable that the end result actually *be* in one of the normalization forms, either NFD or NFC, and that this be explicitly documented as what the process does. Otherwise, you are just tinkering and leaving the data in an indeterminate (although still canonically equivalent) state. Mark recommended NFC in particular, since that is the "least marked" (*hehe*) normalization form, i.e., the one that you are most likely to encounter, and the one that most Internet or web processes are likely to prefer. --Ken P.S. On the other hand, if you asked him nicely, Mark might find the more marked form, NFD, to his liking, especially since it is likely to contain more combining marks. Mark is definitely in favor of markedness. I, on the other hand, am definitely in favor of kennings, but we have found little practical or architectural use for them in the Unicode character-sea.