I agree with Michael -- diacritic folding is a useful folding to add, independent of the UCA.
Also, Peter's remark that: "And it is already covered by the Unicode collation algorithm and default table..." is incorrect. The UCA generally follows our decompositions in determining many primary weights, and we do not decompose characters like U+00D8 LATIN CAPITAL LETTER O WITH STROKE. [I have felt from the beginning that it was a mistake to not be consistent in our decompositions -- but that is water under the bridge.] If you look at John's suggested file for diacritic folding(http://www.ccil.org/~cowan/DiacriticFolding.txt), there are quite a number that are not reflected in the UCA. Below is a filter of those characters in his file that either: (a) are not the same as folding to nfd & removing combining marks (b) are not primary equivalents in uca There is a proposal being worked on to change the UCA primary weights, e.g., to give the same primary weights to O and O WITH STROKE, but as of this point the UCA does not fold the following cases marked "!uca". (Note that for O and O WITH STROKE this would be the *default* UCA weight ; the CLDR already tailors O WITH STROKE above Z for a number of languages.) ============ 0181; 0042; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER B WITH HOOK 0182; 0042; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER B WITH TOPBAR 0187; 0043; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER C WITH HOOK 0110; 0044; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER D WITH STROKE 018A; 0044; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER D WITH HOOK 018B; 0044; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER D WITH TOPBAR 0191; 0046; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER F WITH HOOK 0193; 0047; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER G WITH HOOK 01E4; 0047; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER G WITH STROKE 0126; 0048; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER H WITH STROKE 0197; 0049; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER I WITH STROKE 0198; 004B; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER K WITH HOOK 0141; 004C; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER L WITH STROKE 019D; 004E; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER N WITH LEFT HOOK 0220; 004E; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER N WITH LONG RIGHT LEG 00D8; 004F; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER O WITH STROKE 019F; 004F; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER O WITH MIDDLE TILDE 01FE; 004F; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER O WITH STROKE AND ACUTE 01A4; 0050; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER P WITH HOOK 0166; 0054; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER T WITH STROKE 01AC; 0054; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER T WITH HOOK 01AE; 0054; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER T WITH RETROFLEX HOOK 01B2; 0056; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER V WITH HOOK 01B3; 0059; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER Y WITH HOOK 01B5; 005A; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER Z WITH STROKE 0224; 005A; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER Z WITH HOOK 1E9A; 0061; !nfd+remove_marks; !uca #LATIN SMALL LETTER A WITH RIGHT HALF RING 0180; 0062; !nfd+remove_marks; !uca #LATIN SMALL LETTER B WITH STROKE 0183; 0062; !nfd+remove_marks; !uca #LATIN SMALL LETTER B WITH TOPBAR 0253; 0062; !nfd+remove_marks; !uca #LATIN SMALL LETTER B WITH HOOK 0188; 0063; !nfd+remove_marks; !uca #LATIN SMALL LETTER C WITH HOOK 0255; 0063; !nfd+remove_marks; !uca #LATIN SMALL LETTER C WITH CURL 0111; 0064; !nfd+remove_marks; !uca #LATIN SMALL LETTER D WITH STROKE 018C; 0064; !nfd+remove_marks; !uca #LATIN SMALL LETTER D WITH TOPBAR 0221; 0064; !nfd+remove_marks; !uca #LATIN SMALL LETTER D WITH CURL 0256; 0064; !nfd+remove_marks; !uca #LATIN SMALL LETTER D WITH TAIL 0257; 0064; !nfd+remove_marks; !uca #LATIN SMALL LETTER D WITH HOOK 0192; 0066; !nfd+remove_marks; !uca #LATIN SMALL LETTER F WITH HOOK 01E5; 0067; !nfd+remove_marks; !uca #LATIN SMALL LETTER G WITH STROKE 0260; 0067; !nfd+remove_marks; !uca #LATIN SMALL LETTER G WITH HOOK 0127; 0068; !nfd+remove_marks; !uca #LATIN SMALL LETTER H WITH STROKE 0266; 0068; !nfd+remove_marks; !uca #LATIN SMALL LETTER H WITH HOOK 0268; 0069; !nfd+remove_marks; !uca #LATIN SMALL LETTER I WITH STROKE 029D; 006A; !nfd+remove_marks; !uca #LATIN SMALL LETTER J WITH CROSSED-TAIL 0199; 006B; !nfd+remove_marks; !uca #LATIN SMALL LETTER K WITH HOOK 0140; 006C; !nfd+remove_marks; !uca #LATIN SMALL LETTER L WITH MIDDLE DOT 0142; 006C; !nfd+remove_marks; !uca #LATIN SMALL LETTER L WITH STROKE 019A; 006C; !nfd+remove_marks; !uca #LATIN SMALL LETTER L WITH BAR 0234; 006C; !nfd+remove_marks; !uca #LATIN SMALL LETTER L WITH CURL 026B; 006C; !nfd+remove_marks; !uca #LATIN SMALL LETTER L WITH MIDDLE TILDE 026C; 006C; !nfd+remove_marks; !uca #LATIN SMALL LETTER L WITH BELT 026D; 006C; !nfd+remove_marks; !uca #LATIN SMALL LETTER L WITH RETROFLEX HOOK 0271; 006D; !nfd+remove_marks; !uca #LATIN SMALL LETTER M WITH HOOK 019E; 006E; !nfd+remove_marks; !uca #LATIN SMALL LETTER N WITH LONG RIGHT LEG 0235; 006E; !nfd+remove_marks; !uca #LATIN SMALL LETTER N WITH CURL 0272; 006E; !nfd+remove_marks; !uca #LATIN SMALL LETTER N WITH LEFT HOOK 0273; 006E; !nfd+remove_marks; !uca #LATIN SMALL LETTER N WITH RETROFLEX HOOK 00F8; 006F; !nfd+remove_marks; !uca #LATIN SMALL LETTER O WITH STROKE 01FF; 006F; !nfd+remove_marks; !uca #LATIN SMALL LETTER O WITH STROKE AND ACUTE 01A5; 0070; !nfd+remove_marks; !uca #LATIN SMALL LETTER P WITH HOOK 02A0; 0071; !nfd+remove_marks; !uca #LATIN SMALL LETTER Q WITH HOOK 027C; 0072; !nfd+remove_marks; !uca #LATIN SMALL LETTER R WITH LONG LEG 027D; 0072; !nfd+remove_marks; !uca #LATIN SMALL LETTER R WITH TAIL 0282; 0073; !nfd+remove_marks; !uca #LATIN SMALL LETTER S WITH HOOK 0167; 0074; !nfd+remove_marks; !uca #LATIN SMALL LETTER T WITH STROKE 01AB; 0074; !nfd+remove_marks; !uca #LATIN SMALL LETTER T WITH PALATAL HOOK 01AD; 0074; !nfd+remove_marks; !uca #LATIN SMALL LETTER T WITH HOOK 0236; 0074; !nfd+remove_marks; !uca #LATIN SMALL LETTER T WITH CURL 0288; 0074; !nfd+remove_marks; !uca #LATIN SMALL LETTER T WITH RETROFLEX HOOK 028B; 0076; !nfd+remove_marks; !uca #LATIN SMALL LETTER V WITH HOOK 01B4; 0079; !nfd+remove_marks; !uca #LATIN SMALL LETTER Y WITH HOOK 01B6; 007A; !nfd+remove_marks; !uca #LATIN SMALL LETTER Z WITH STROKE 0225; 007A; !nfd+remove_marks; !uca #LATIN SMALL LETTER Z WITH HOOK 0290; 007A; !nfd+remove_marks; !uca #LATIN SMALL LETTER Z WITH RETROFLEX HOOK 0291; 007A; !nfd+remove_marks; !uca #LATIN SMALL LETTER Z WITH CURL 025A; 0259; !nfd+remove_marks; !uca #LATIN SMALL LETTER SCHWA WITH HOOK 0286; 0283; !nfd+remove_marks; !uca #LATIN SMALL LETTER ESH WITH CURL 01BA; 0292; !nfd+remove_marks; !uca #LATIN SMALL LETTER EZH WITH TAIL 0293; 0292; !nfd+remove_marks; !uca #LATIN SMALL LETTER EZH WITH CURL 04D0; 0410; ; !uca #CYRILLIC CAPITAL LETTER A WITH BREVE 04D2; 0410; ; !uca #CYRILLIC CAPITAL LETTER A WITH DIAERESIS 0490; 0413; !nfd+remove_marks; #CYRILLIC CAPITAL LETTER GHE WITH UPTURN 0492; 0413; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER GHE WITH STROKE 0494; 0413; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER GHE WITH MIDDLE HOOK 04D6; 0415; ; !uca #CYRILLIC CAPITAL LETTER IE WITH BREVE 0496; 0416; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER ZHE WITH DESCENDER 04DC; 0416; ; !uca #CYRILLIC CAPITAL LETTER ZHE WITH DIAERESIS 0498; 0417; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER ZE WITH DESCENDER 04DE; 0417; ; !uca #CYRILLIC CAPITAL LETTER ZE WITH DIAERESIS 04E4; 0418; ; !uca #CYRILLIC CAPITAL LETTER I WITH DIAERESIS 048A; 0419; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER SHORT I WITH TAIL 049A; 041A; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER KA WITH DESCENDER 049C; 041A; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER KA WITH VERTICAL STROKE 049E; 041A; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER KA WITH STROKE 04C3; 041A; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER KA WITH HOOK 04C5; 041B; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER EL WITH TAIL 04CD; 041C; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER EM WITH TAIL 04A2; 041D; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER EN WITH DESCENDER 04C7; 041D; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER EN WITH HOOK 04C9; 041D; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER EN WITH TAIL 04E6; 041E; ; !uca #CYRILLIC CAPITAL LETTER O WITH DIAERESIS 04A6; 041F; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER PE WITH MIDDLE HOOK 048E; 0420; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER ER WITH TICK 04AA; 0421; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER ES WITH DESCENDER 04AC; 0422; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER TE WITH DESCENDER 04F0; 0423; ; !uca #CYRILLIC CAPITAL LETTER U WITH DIAERESIS 04F2; 0423; ; !uca #CYRILLIC CAPITAL LETTER U WITH DOUBLE ACUTE 04B2; 0425; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER HA WITH DESCENDER 04B3; 0425; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER HA WITH DESCENDER 04F4; 0427; ; !uca #CYRILLIC CAPITAL LETTER CHE WITH DIAERESIS 04F8; 042B; ; !uca #CYRILLIC CAPITAL LETTER YERU WITH DIAERESIS 04EC; 042D; ; !uca #CYRILLIC CAPITAL LETTER E WITH DIAERESIS 04D1; 0430; ; !uca #CYRILLIC SMALL LETTER A WITH BREVE 04D3; 0430; ; !uca #CYRILLIC SMALL LETTER A WITH DIAERESIS 0491; 0433; !nfd+remove_marks; #CYRILLIC SMALL LETTER GHE WITH UPTURN 0493; 0433; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER GHE WITH STROKE 0495; 0433; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER GHE WITH MIDDLE HOOK 04D7; 0435; ; !uca #CYRILLIC SMALL LETTER IE WITH BREVE 0497; 0436; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER ZHE WITH DESCENDER 04DD; 0436; ; !uca #CYRILLIC SMALL LETTER ZHE WITH DIAERESIS 0499; 0437; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER ZE WITH DESCENDER 04DF; 0437; ; !uca #CYRILLIC SMALL LETTER ZE WITH DIAERESIS 04E5; 0438; ; !uca #CYRILLIC SMALL LETTER I WITH DIAERESIS 048B; 0439; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER SHORT I WITH TAIL 049B; 043A; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER KA WITH DESCENDER 049D; 043A; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER KA WITH VERTICAL STROKE 049F; 043A; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER KA WITH STROKE 04C4; 043A; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER KA WITH HOOK 04C6; 043B; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER EL WITH TAIL 04CE; 043C; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER EM WITH TAIL 04A3; 043D; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER EN WITH DESCENDER 04C8; 043D; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER EN WITH HOOK 04CA; 043D; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER EN WITH TAIL 04E7; 043E; ; !uca #CYRILLIC SMALL LETTER O WITH DIAERESIS 04A7; 043F; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER PE WITH MIDDLE HOOK 048F; 0440; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER ER WITH TICK 04AB; 0441; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER ES WITH DESCENDER 04AD; 0442; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER TE WITH DESCENDER 04F1; 0443; ; !uca #CYRILLIC SMALL LETTER U WITH DIAERESIS 04F3; 0443; ; !uca #CYRILLIC SMALL LETTER U WITH DOUBLE ACUTE 04B9; 0447; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER CHE WITH VERTICAL STROKE 04F5; 0447; ; !uca #CYRILLIC SMALL LETTER CHE WITH DIAERESIS 04F9; 044B; ; !uca #CYRILLIC SMALL LETTER YERU WITH DIAERESIS 04ED; 044D; ; !uca #CYRILLIC SMALL LETTER E WITH DIAERESIS 047C; 0460; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER OMEGA WITH TITLO 047D; 0461; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER OMEGA WITH TITLO 0476; 0474; ; !uca #CYRILLIC CAPITAL LETTER IZHITSA WITH DOUBLE GRAVE ACCENT 0477; 0475; ; !uca #CYRILLIC SMALL LETTER IZHITSA WITH DOUBLE GRAVE ACCENT 04B0; 04AE; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER STRAIGHT U WITH STROKE 04B1; 04AF; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER STRAIGHT U WITH STROKE 04B6; 04BC; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER CHE WITH DESCENDER 04B7; 04BC; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER CHE WITH DESCENDER 04B8; 04BC; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER CHE WITH VERTICAL STROKE 04BE; 04BC; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER ABKHASIAN CHE WITH DESCENDER 04BF; 04BC; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER ABKHASIAN CHE WITH DESCENDER 04CB; 04BC; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER KHAKASSIAN CHE 04CC; 04BC; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER KHAKASSIAN CHE 04DA; 04D8; ; !uca #CYRILLIC CAPITAL LETTER SCHWA WITH DIAERESIS 04DB; 04D9; ; !uca #CYRILLIC SMALL LETTER SCHWA WITH DIAERESIS 04EA; 04E8; ; !uca #CYRILLIC CAPITAL LETTER BARRED O WITH DIAERESIS 04EB; 04E9; ; !uca #CYRILLIC SMALL LETTER BARRED O WITH DIAERESIS âMark ----- Original Message ----- From: "Michael (michka) Kaplan" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Friday, July 09, 2004 07:40 Subject: Re: Looking for transcription or transliteration standards latin- >arabic > From: "Peter Kirk" <[EMAIL PROTECTED]> > > > But Kaplan is referring to something quite different, optionally > > ignoring diacritics in search operations. This is indeed desirable, so > > that a single search can match both Dvorak and DvoÅÃk for example, and > > so that the one doing the search does not need to remember exactly which > > diacritics are used in the name. And it is already covered by the > > Unicode collation algorithm and default table, in which diacritics are > > distinguished only at the second level and so folded by a top level only > > collation. > > (a) If this were true and it were the only need, then case folding would > also just be "a UCA issue", yet case folding is in the document. > > (b) Not everyone uses the UCA who uses Unicode (most of the corporate > members companies in Unicode -- including IBM -- had alternate collation > methods that existed prior to the UCA and which to this day support more > languages, in their databases and operating systems) > > (c) Since the operation (diacritic folding) is a valid one that > implementations may want to do and the UCA is a UTS and thus not required > for Unicode conformance, it is a sensible folding operation to define. > > Does diacritic folding destroy information provided by the distinctions that > diacritcs provide? Of course it does. But then again, the same can be said > of all foldings. This does not diminish their potential usefulness in > specific tasks/operations. > > > MichKa [MS] > NLS Collation/Locale/Keyboard Development > Globalization Infrastructure and Font Technologies > Windows International Division > > >

