Re: Folding algorithm and canonical equivalence

Peter Kirk Sun, 18 Jul 2004 16:06:29 -0700

On 18/07/2004 22:15, Asmus Freytag wrote:

At 05:25 AM 7/18/2004, Peter Kirk wrote:
I accept that there might be some script-specific cases in which particular accents should not be removed. The breve in Cyrillic i kratkoe might be an example; but then this might be rather too language-specific as well. But these should be clearly defined and justified exceptions, rather than their possible existence being a reason to restrict the general applicability of accent and diacritic folding.
I was thinking rather more of Khmer, where a some characters that are considered letters are given gc=Mn. In that case, folding would be very inappropriate.

So the answer has to be to limit the removal of diacritical marks in AccentFolding, to those that are truly *accents*. That's a subset of gc=Mn. There are two options for a starting set: select all 'accents' (note, not baseforms) that occur in some precomposed character. And then add additional ones on a case by case basis (e.g. stroke overlay).

Or, start with all gc=Mn from the 0300 and 1DC0 blocks (the latter will be part of 4.1), and make some principled additions / deletions.

This sounds good to me. Among the additions should be all Hebrew combining marks unless this is done separately.

All script-specific non-spacing marks for Indic scripts etc; should not be part of 'AccentFolding', in my opinion.

.. when I look more closely at AccentFolding as defined I see a problem with it. It is specified as affecting only "Latin/Greek/Cyrillic characters with canonical decomposition". But this is inadequate because there are many cases of Latin/Greek/Cyrillic characters (and most cases of Hebrew ones) where an accent should be removed even though there is no precomposed form encoded and so canonical decomposition
Correct. Whatever the set of combining marks is, we then need to define a set of base characters. We could simply use sc=Latin + sc=Greek + sc=Cyrillic as a starting set, to treat all accented character equally.
What about other scripts:
If you feel that Hebrew folding to unpointed is something that should happen everytime other accents are folded, we can add Hebrew (or we can make a separate fodling, HebrewMarksFolding, that people can invoke optionally) I tend to prefer the latter. Since for Hebrew (the languages), a folding to unpointed might be one of the foldings that someone might want to apply permanently, it should be separtely named and defined, on the principle that the foldings should be building blocks.

Agreed that it should be separate, but I would also see it as included as a subset within the regular accent folding.


--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Folding algorithm and canonical equivalence

Reply via email to