Re: Normalisation stability, was: Compression through normalization

Peter Kirk Tue, 25 Nov 2003 12:56:16 -0800

On 25/11/2003 10:03, John Cowan wrote:

...  And as for
canonical equivalence, the most efficient way to compare strings for
it is to normalize both of them in some way and then do a raw
binary compare.  Since it adds efficiency to normalize only once,
it is worthwhile to define a few normalization forms and urge
people to produce text in one of them, so that receivers need not
normalize but need only check for normalization, typically much cheaper.

If receivers are expected to check for normalisation, they are presumably expected also to normalise if the check fails; if they do not, they are in conflict with conformance clause C9 - at least with the "ideally" of the last paragraph and probably with the principle "no process can assume that another process will make a distinction between two different, but canonical-equivalent character sequences.". The efficiency gain is because it is expected that the great majority of received strings are already normalised. But the system must be able to cope with a small proportion of non-normalised strings. And so if combining classes are changed in such a way that the normalised form of certain rare or anomalous strings is not preserved, the system can cope. And thus the argument from normalisation stability against changing combining classes also fails, at least where those changes are made to rare or obscure characters, or combinations of characters, which are little used in existing texts. One example, if Doug will forgive me, is Hebrew points. There may well be others.

So, it seems that Unicode has bound itself by its stability policy to something which is both unnecessary and in fundamental conflict with its own conformance clause C10. I urge reconsideration of the policy.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Normalisation stability, was: Compression through normalization

Reply via email to