RE: Folding algorithm and canonical equivalence

Jony Rosenne Sun, 18 Jul 2004 20:25:03 -0700

By this logic, I cannot see why you lump Latin/Greek/Cyrillic together.

I think there should be a single diacritics removal folding, which should be
tailorable.


Jony

> -----Original Message-----
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On Behalf Of Asmus Freytag
> Sent: Monday, July 19, 2004 12:16 AM
> To: Peter Kirk
> Cc: John Cowan; Unicode List; jony Rosenne
> Subject: Re: Folding algorithm and canonical equivalence
> 
> 
> At 05:25 AM 7/18/2004, Peter Kirk wrote:
> >I accept that there might be some script-specific cases in which
> >particular accents should not be removed. The breve in 
> Cyrillic i kratkoe 
> >might be an example; but then this might be rather too 
> language-specific 
> >as well. But these should be clearly defined and justified 
> exceptions, 
> >rather than their possible existence being a reason to restrict the 
> >general applicability of accent and diacritic folding.
> 
> I was thinking rather more of Khmer, where a some characters that are 
> considered letters are given gc=Mn. In that case, folding 
> would be very 
> inappropriate.
> 
> So the answer has to be to limit the removal of diacritical marks in 
> AccentFolding, to those that are truly *accents*. That's a 
> subset of gc=Mn. 
> There are two options for a starting set:
> select all 'accents' (note, not baseforms) that occur in some 
> precomposed 
> character. And then add additional ones on a case by case basis (e.g. 
> stroke overlay).
> 
> Or, start with all gc=Mn from the 0300 and 1DC0 blocks (the 
> latter will be 
> part of 4.1), and make some principled additions / deletions.
> 
> All script-specific non-spacing marks for Indic scripts etc; 
> should not be 
> part of 'AccentFolding', in my opinion.
> 
> >.. when I look more closely at AccentFolding as defined I 
> see a problem
> >with it. It is specified as affecting only "Latin/Greek/Cyrillic 
> >characters with canonical decomposition". But this is 
> inadequate because 
> >there are many cases of Latin/Greek/Cyrillic characters (and 
> most cases of 
> >Hebrew ones) where an accent should be removed even though 
> there is no 
> >precomposed form encoded and so canonical decomposition
> 
> Correct. Whatever the set of combining marks is, we then need 
> to define a 
> set of base characters. We could simply use sc=Latin + sc=Greek + 
> sc=Cyrillic as a starting set, to treat all accented 
> character equally.
> 
> What about other scripts:
> 
> If you feel that Hebrew folding to unpointed is something that should 
> happen everytime other accents are folded, we can add Hebrew 
> (or we can 
> make a separate fodling, HebrewMarksFolding,
> that people can invoke optionally)  I tend to prefer the 
> latter. Since for 
> Hebrew (the languages), a folding to unpointed might be one 
> of the foldings 
> that someone might want to apply permanently, it should be 
> separtely named 
> and defined, on the principle that the foldings should be 
> building blocks.
> 
> A./ 
> 
> 
> 
>

RE: Folding algorithm and canonical equivalence

Reply via email to