Re: Folding algorithm and canonical equivalence

Peter Kirk Sun, 18 Jul 2004 05:44:56 -0700

On 18/07/2004 08:52, Asmus Freytag wrote:

At 11:15 PM 7/17/2004, John Cowan wrote:
I agree that in the TR#30 context, the Right Thing is to remove the
character pair mappings altogether, and all of the single-character
mappings that have canonical decompositions
In other words, in your opinion, the reasonable thing to do would be for someone to do the AccentFolding as defined in the TR, and then do a DiacriticFolding, to fold the cases where even in NFD accents don't exist as as separate characters.

This is not quite what I had in mind, but only because when I look more closely at AccentFolding as defined I see a problem with it. It is specified as affecting only "Latin/Greek/Cyrillic characters with canonical decomposition". But this is inadequate because there are many cases of Latin/Greek/Cyrillic characters (and most cases of Hebrew ones) where an accent should be removed even though there is no precomposed form encoded and so canonical decomposition. This definition needs to be extended to deletion of all accents, i.e. probably all non-spacing combining marks, regardless of whether there is a canonical decomposition, at least when the base character is Latin/Greek/Cyrillic/Hebrew (and probably also at least Arabic and Syriac, in which combining marks function much as in Hebrew).

Such an extended AccentFolding would then function as a good base for a broader DiacriticFolding.

That's certainly reasonable and not the only case where it's interesting to have chained foldings.

Jony is arguing to extend AccentFolding to Hebrew (fold to unpointed). His suggestion is to fold *all* combining marks used with Hebrew in that case. I want to double check that he really means all combining marks in the Hebrew block, or just some of them.

AccentFolding can't just fold all gc=Mn, since that would include quite a few that are script specific as well as the marks for Symbols, for which different folding rules might need to apply in some context. So I think I'll use as the set of accents to remove all the ones that show up as part of decompositions, ...

This restriction will end up with some ridiculous results if applied to a language in which only some of the regular letters are supported as precomposed forms: the identical mark will be stripped from some base characters but not from others.

I would suggest that if "different folding rules might need to apply in some context", a different folding should be applied rather than trying to overload an existing folding whose function is supposed to be to remove accents or diacritics. If it removes some accents or diacritics from some base characters, but does not remove all from all, users will simply reject the folding as unreliable.

I accept that there might be some script-specific cases in which particular accents should not be removed. The breve in Cyrillic i kratkoe might be an example; but then this might be rather too language-specific as well. But these should be clearly defined and justified exceptions, rather than their possible existence being a reason to restrict the general applicability of accent and diacritic folding.

... plus as many Hebrew accents that Jony can confirm.
(another alternative would be to make the Hebrew folding a separate definition, to allow people to apply one, but not the other.)

I'll make another Draft of DiacriticFolding.txt with the canonical decomp derivables removed. A./

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Folding algorithm and canonical equivalence

Reply via email to