Re: Back to the subject: Folding algorithm and canonical equivalence

Mark Davis Mon, 19 Jul 2004 14:28:13 -0700

You did point out an oversight; Asmus and I have been working on the issue.


âMark

----- Original Message ----- 
From: "Peter Kirk" <[EMAIL PROTECTED]>
To: "Unicode List" <[EMAIL PROTECTED]>
Sent: Monday, July 19, 2004 13:21
Subject: Back to the subject: Folding algorithm and canonical equivalence


> There has been extensive discussion in this thread on the specifics of
> accent and diacritic folding. But no one has answered my point, repeated
> below, that there seems to be a conflict between the folding algorithm
> (rather than the details of specific foldings) and the principle of
> canonical equivalence. Specifically, it seems to breach the principle in
> Unicode Conformance Clause C9:
>
> > Ideally, an implementation would always interpret two
> > canonical-equivalent character
> > sequences identically. There are practical circumstances under which
> > implementations
> > may reasonably distinguish them.
>
> Are the authors of UTR #30 claiming that folding is one of those
> practical circumstances, or is this just an oversight?
>
> Peter Kirk
>
> On 17/07/2004 23:25, Peter Kirk wrote:
>
> > I was just reviewing the UTR #30 draft in response to Rick's notice
> > about it. And I believe I may have found a point in which the folding
> > algorithm as given may violate the principle of canonical equivalence.
> > But I would like some clarification from list members before providing
> > formal input on this point.
> >
> > Consider a sequence made up of a base character B and two combining
> > marks M1 and M2, in which the combining class of M1 is less than that
> > of M2. <B, M1, M2> and <B, M2, M1> are canonically equivalent
> > representations of the same sequence, but only the former is in
> > canonical order. Suppose that a folding is defined including the
> > operation <B, M2> -> X, but no other relevant operations. When this
> > folding is applied, according to the folding algorithms defined in
> > sections 4.1.1 and 4.1.2 of the UTR #30 draft, in step (a) the
> > sequence <B, M2, M1> will be folded to <X, M1> and will not be further
> > changed, but the sequence <B, M1, M2> will not be changed at all by
> > the folding because the sequence <B, M2> will never be found. (By
> > contrast, a folding operation <B, M1> -> Y will be applied to both
> > sequences, because the canonical decomposition step converts <B, M2,
> > M1> to <B, M1, M2> and the folding operation is re-applied and finds a
> > match the second time.) The implication is that folding of two
> > canonically equivalent strings gives different (and not canonically
> > equivalent) results.
> >
> > This is not a purely theoretical point. The Diacritic Folding as
> > specified in
> > http://www.unicode.org/reports/tr30/datafiles/DiacriticFolding.txt
> > includes operations like 05D1 05BC -> 05D1, i.e. <BET, DAGESH> -> BET,
> > but no general rule to delete DAGESH (or any other combining marks; I
> > think there needs to be such a rule, and I have already posted a
> > formal response saying that). Sequences like <BET, DAGESH, PATAH> are
> > very common in Hebrew text, and commonly written in this order which
> > is logically correct and preferred by current rendering technologies,
> > but the canonical order is in fact <BET, PATAH, DAGESH>; thus both
> > sequences will be found in data depending on whether or not it has
> > been normalised. The effect of applying Diacritic Folding exactly as
> > specified is that <BET, DAGESH, PATAH> is folded to <BET, PATAH>, but
> > the canonically equivalent <BET, PATAH, DAGESH> is unchanged. (In fact
> > I consider that both should be folded to just BET, but that is not
> > what the current data file specifies.)
> >
> > I hope I have not totally misunderstood the folding algorithm here.
> > But it seems to me that what is missing in the algorithm is an initial
> > step of normalising the data. The introductory text to section 4 seems
> > to suggest that this has been avoided because folding may need to
> > preserve the distinction between NFC and NFD data - although the
> > algorithm as presented does not in fact do this. Since in practice the
> > input data is not necessarily in either NFC or NFD and there is no
> > easy way to detect which is being used, the only meaningful approach
> > is for the user of the folding to specify whether the output of the
> > folding should be NFC or NFD.
> >
> > Of course there might be a real requirement for a folding which, for
> > example, removes DAGESH when combined with BET (but not with other
> > base characters) irrespective of what other combining marks might
> > intervene. But such foldings would need a considerably more powerful
> > folding algorithm.
> >
>
>
> -- 
> Peter Kirk
> [EMAIL PROTECTED] (personal)
> [EMAIL PROTECTED] (work)
> http://www.qaya.org/
>
>
>

Re: Back to the subject: Folding algorithm and canonical equivalence

Reply via email to