By this logic, I cannot see why you lump Latin/Greek/Cyrillic together. I think there should be a single diacritics removal folding, which should be tailorable.
Jony > -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Asmus Freytag > Sent: Monday, July 19, 2004 12:16 AM > To: Peter Kirk > Cc: John Cowan; Unicode List; jony Rosenne > Subject: Re: Folding algorithm and canonical equivalence > > > At 05:25 AM 7/18/2004, Peter Kirk wrote: > >I accept that there might be some script-specific cases in which > >particular accents should not be removed. The breve in > Cyrillic i kratkoe > >might be an example; but then this might be rather too > language-specific > >as well. But these should be clearly defined and justified > exceptions, > >rather than their possible existence being a reason to restrict the > >general applicability of accent and diacritic folding. > > I was thinking rather more of Khmer, where a some characters that are > considered letters are given gc=Mn. In that case, folding > would be very > inappropriate. > > So the answer has to be to limit the removal of diacritical marks in > AccentFolding, to those that are truly *accents*. That's a > subset of gc=Mn. > There are two options for a starting set: > select all 'accents' (note, not baseforms) that occur in some > precomposed > character. And then add additional ones on a case by case basis (e.g. > stroke overlay). > > Or, start with all gc=Mn from the 0300 and 1DC0 blocks (the > latter will be > part of 4.1), and make some principled additions / deletions. > > All script-specific non-spacing marks for Indic scripts etc; > should not be > part of 'AccentFolding', in my opinion. > > >.. when I look more closely at AccentFolding as defined I > see a problem > >with it. It is specified as affecting only "Latin/Greek/Cyrillic > >characters with canonical decomposition". But this is > inadequate because > >there are many cases of Latin/Greek/Cyrillic characters (and > most cases of > >Hebrew ones) where an accent should be removed even though > there is no > >precomposed form encoded and so canonical decomposition > > Correct. Whatever the set of combining marks is, we then need > to define a > set of base characters. We could simply use sc=Latin + sc=Greek + > sc=Cyrillic as a starting set, to treat all accented > character equally. > > What about other scripts: > > If you feel that Hebrew folding to unpointed is something that should > happen everytime other accents are folded, we can add Hebrew > (or we can > make a separate fodling, HebrewMarksFolding, > that people can invoke optionally) I tend to prefer the > latter. Since for > Hebrew (the languages), a folding to unpointed might be one > of the foldings > that someone might want to apply permanently, it should be > separtely named > and defined, on the principle that the foldings should be > building blocks. > > A./ > > > >