On 28 Jul 2003, at 16:49, Kenneth Whistler wrote:

> Part of the specification of the Unicode normalization algorithm
> is idempotency *across* versions, so that addition of new
> characters to the standard, which require extensions of the
> tables for decomposition, recomposition, and composition
> exclusion in the algorithm, does *not* result in a situation 
> where application of a later version of the normalization algorithm 
> results in change of *any* string normalized by an earlier version 
> of the algorithm.
> 
> The suggested changes in combining class values would break *that*
> specification.

   Is this really the case? It seems to me that if 2 letters that (in an 
earlier version of Unicode) had different combining classes were changed (by a 
later version) to have the same combining class, it would still be backwards 
compatible. The effect is the same as if the normalisation had not been done, 
and the principal of "be conservative in what you generate, but liberal in what 
you accept" means that no-one should be assuming that content which they 
receive has been normalised.

   In other words, if you receive i-a in Hebrew, you may deduce that it is not 
normalised, and normalise it yourself; and you have to do that anyway, so there 
is no loss.

   If what I'm saying is true, then it is always possible for new versions of 
Unicode to change combining classes, as long as the following rule is observed:

      ---any 2 distinct character sequences which map to 2 distinct normalised
            sequences must always do so, but

      ---if 2 distinct character sequences map to the same normalised character
             sequence in an earlier version of Unicode, they may map to
             distinct sequences in a later version.

   (Or, in other words, information that was retained must not be lost, but 
just because information was discarded by an earlier version does not mean that 
it will always be discarded.)

   As new characters are encoded in Unicode, *backwards* compatibility is 
assured, but not forwards. If your application assumes that an unencoded code 
point will remain unencoded for all time, then eventually it will get an 
unpleasant shock. This is OK because we know certain kinds of change are 
allowed. It is just this reasoning, applied to combining classes, that lets us 
conclude that *merging* classes is allowed, but that if 2 characters have the 
same class, they must have the same class forever.

   This implies that if characters X and Y, with combining classes A and B, 
have a semantic difference between XY and YX which we discover only belatedly, 
then we may set the combining classes of both of them to some value C between A 
and B (it doesn't matter which value we pick, as long as A <= C <= B), BUT we 
must also set the combining classes of *all other characters* with a class D 
that lies in the range A <= D <= B to C also.

   I don't see why anyone who accepts that Unicode is an extensible character 
set could object to such a change. And luckily, it's just what would solve the 
Hebrew normalisation problem.

        /|
 . . . (_|/ o n a t h a n
        /|
       (_/

Reply via email to