You did point out an oversight; Asmus and I have been working on the issue.
âMark ----- Original Message ----- From: "Peter Kirk" <[EMAIL PROTECTED]> To: "Unicode List" <[EMAIL PROTECTED]> Sent: Monday, July 19, 2004 13:21 Subject: Back to the subject: Folding algorithm and canonical equivalence > There has been extensive discussion in this thread on the specifics of > accent and diacritic folding. But no one has answered my point, repeated > below, that there seems to be a conflict between the folding algorithm > (rather than the details of specific foldings) and the principle of > canonical equivalence. Specifically, it seems to breach the principle in > Unicode Conformance Clause C9: > > > Ideally, an implementation would always interpret two > > canonical-equivalent character > > sequences identically. There are practical circumstances under which > > implementations > > may reasonably distinguish them. > > Are the authors of UTR #30 claiming that folding is one of those > practical circumstances, or is this just an oversight? > > Peter Kirk > > On 17/07/2004 23:25, Peter Kirk wrote: > > > I was just reviewing the UTR #30 draft in response to Rick's notice > > about it. And I believe I may have found a point in which the folding > > algorithm as given may violate the principle of canonical equivalence. > > But I would like some clarification from list members before providing > > formal input on this point. > > > > Consider a sequence made up of a base character B and two combining > > marks M1 and M2, in which the combining class of M1 is less than that > > of M2. <B, M1, M2> and <B, M2, M1> are canonically equivalent > > representations of the same sequence, but only the former is in > > canonical order. Suppose that a folding is defined including the > > operation <B, M2> -> X, but no other relevant operations. When this > > folding is applied, according to the folding algorithms defined in > > sections 4.1.1 and 4.1.2 of the UTR #30 draft, in step (a) the > > sequence <B, M2, M1> will be folded to <X, M1> and will not be further > > changed, but the sequence <B, M1, M2> will not be changed at all by > > the folding because the sequence <B, M2> will never be found. (By > > contrast, a folding operation <B, M1> -> Y will be applied to both > > sequences, because the canonical decomposition step converts <B, M2, > > M1> to <B, M1, M2> and the folding operation is re-applied and finds a > > match the second time.) The implication is that folding of two > > canonically equivalent strings gives different (and not canonically > > equivalent) results. > > > > This is not a purely theoretical point. The Diacritic Folding as > > specified in > > http://www.unicode.org/reports/tr30/datafiles/DiacriticFolding.txt > > includes operations like 05D1 05BC -> 05D1, i.e. <BET, DAGESH> -> BET, > > but no general rule to delete DAGESH (or any other combining marks; I > > think there needs to be such a rule, and I have already posted a > > formal response saying that). Sequences like <BET, DAGESH, PATAH> are > > very common in Hebrew text, and commonly written in this order which > > is logically correct and preferred by current rendering technologies, > > but the canonical order is in fact <BET, PATAH, DAGESH>; thus both > > sequences will be found in data depending on whether or not it has > > been normalised. The effect of applying Diacritic Folding exactly as > > specified is that <BET, DAGESH, PATAH> is folded to <BET, PATAH>, but > > the canonically equivalent <BET, PATAH, DAGESH> is unchanged. (In fact > > I consider that both should be folded to just BET, but that is not > > what the current data file specifies.) > > > > I hope I have not totally misunderstood the folding algorithm here. > > But it seems to me that what is missing in the algorithm is an initial > > step of normalising the data. The introductory text to section 4 seems > > to suggest that this has been avoided because folding may need to > > preserve the distinction between NFC and NFD data - although the > > algorithm as presented does not in fact do this. Since in practice the > > input data is not necessarily in either NFC or NFD and there is no > > easy way to detect which is being used, the only meaningful approach > > is for the user of the folding to specify whether the output of the > > folding should be NFC or NFD. > > > > Of course there might be a real requirement for a folding which, for > > example, removes DAGESH when combined with BET (but not with other > > base characters) irrespective of what other combining marks might > > intervene. But such foldings would need a considerably more powerful > > folding algorithm. > > > > > -- > Peter Kirk > [EMAIL PROTECTED] (personal) > [EMAIL PROTECTED] (work) > http://www.qaya.org/ > > >

