Ah, that explains it. You had filed this against ICU, not UCA; that explains why I couldn't find it in the Unicode reports.
A. Final. > 1) Precedence of Dagesh over Final/non-Final: in the chart, the presence > or absence of Dagesh is a Secundary difference, while Final/non-Final is a > Tertiary difference. This is relevant only for letters Kaf and Pe. My > gut feeling says that Final/non-Final should have precedence over > Dagesh/no-Dagesh. > Note that the number of actual cases where this would make a difference is > probably *very* small. So there are two issues for final vs non-final: strength and ordering. A1. Ordering is easy to change; in ICU or UCA we could put the final values before the independent letters. In ICU they are just rules, while in UCA they follow http://www.unicode.org/reports/tr10/tr10-10.html#Tertiary_Weight_Table. The easiest in UCA would be to give the 5 independent forms that have finals the value <isolated>. Note: there is one minor fallout in ICU: we optimize the sortkey compression of tertiary values of NONE; if we change the ordering then each instance of the <isolated> letters will mean about a 2-3 byte increase in sort-key sizes. A2. For Strength, it is not as clear cut. If Final vs non-Final is more important than dagesh, etc, the easiest thing is to make it a primary difference; but that would make Zayin Yod PeFinal sort before all words Zayin Yod Pe XXX But I'm guessing that is probably not desired for Hebrew. In ICU we could make Final vs non-Final be a secondary difference, and have Dagesh, etc. be tertiary differences. The disadvantage is that people tend to expect the 2nd level to be 'accent-like', and there might be more inconsistencies in practice than you would gain by having the current situation. In Unicode, the UCA has more production restrictions as per http://www.unicode.org/reports/tr10/tr10-10.html#Tertiary_Weight_Table, so it would be a bit harder to make that change. So if SII would like this change, I'd recommend that we make the ordering change in UCA (which will then affect ICU), but not make a stength change (it would have to be extremely exotic for that to make a difference). Cf. http://www.unicode.org/charts/collation/chart_Hebrew.html B. Dagesh > 2) There is something strange in the combinations of Shin with Dagesh and > dots: for all other letters, the form without Dagesh sorts before the form > with Dagesh. But Shin with Sin/Shin dot sort after their corresponding > combinations with Dagesh. I cannot imagine a justification for that. We have currently in UCA the following (from UCA 4.0.0d1 (beta)) 05B0 ; [.0000.00B2.0002.05B0] # HEBREW POINT SHEVA 05B1 ; [.0000.00B3.0002.05B1] # HEBREW POINT HATAF SEGOL 05B2 ; [.0000.00B4.0002.05B2] # HEBREW POINT HATAF PATAH 05B3 ; [.0000.00B5.0002.05B3] # HEBREW POINT HATAF QAMATS 05B4 ; [.0000.00B6.0002.05B4] # HEBREW POINT HIRIQ 05B5 ; [.0000.00B7.0002.05B5] # HEBREW POINT TSERE 05B6 ; [.0000.00B8.0002.05B6] # HEBREW POINT SEGOL 05B7 ; [.0000.00B9.0002.05B7] # HEBREW POINT PATAH 05B8 ; [.0000.00BA.0002.05B8] # HEBREW POINT QAMATS 05B9 ; [.0000.00BB.0002.05B9] # HEBREW POINT HOLAM 05BB ; [.0000.00BC.0002.05BB] # HEBREW POINT QUBUTS 05BC ; [.0000.00BD.0002.05BC] # HEBREW POINT DAGESH OR MAPIQ 05BF ; [.0000.00C0.0002.05BF] # HEBREW POINT RAFE 05C1 ; [.0000.00C1.0002.05C1] # HEBREW POINT SHIN DOT 05C2 ; [.0000.00C2.0002.05C2] # HEBREW POINT SIN DOT FB1E ; [.0000.00C3.0002.FB1E] # HEBREW POINT JUDEO-SPANISH VARIKA To make this change, we would move Dagesh to after SIN DOT. Question: should it also go after VARIKA or not? Mark __________________________________ http://www.macchiato.com ► “Eppur si muove” ◄ ----- Original Message ----- From: "Matitiahu Allouche" <[EMAIL PROTECTED]> To: "Mark Davis" <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Tuesday, August 19, 2003 01:21 Subject: [bidi] Re: Unicode Collation Algorithm: 4.0 Update (beta) > Hello, Mark! > > There must be some hole in your email archive :-), since you yourself > expressed your personal take on the issues. On 04/05/03 (probably 4th of > May rather than 5th of April) you wrote me: > <QUOTE> > From: Mark [EMAIL PROTECTED] on 04/05/2003 03:22 > To: Matitiahu Allouche/Israel/[EMAIL PROTECTED] > cc: Israel Gidali/Israel/[EMAIL PROTECTED] > > From: Mark Davis/Cupertino/[EMAIL PROTECTED] > Subject: Bug on Hebrew Collation > Importance: Urgent > > > I am working through some collation bugs, and had a question about: > > http://www.jtcsv.com/cgibin/icu-bugs/collation?id=1489;user=guest > > Mati, your comments look reasonable. I am, however, a little nervous since > as far as I know, the Israeli government committee had input into the > basic table for ISO 14651, which is reflected in the UCA. (We don't modify > it for Hebrew). Can you confirm with them that these tailorings should be > made? > > Mark > </QUOTE> > > I did not formally submit anything to the UTC, though, so I may be > responsible for my own misfortune. At that time, I had 4 remarks. It > seems that 2 of them have been implemented, and the 2 others have not. > > I have second thoughts about the tertiary weight allocated to final > letters (0019) as compared to that allocated to non-final letters (0002). > That means that final letters are collated *after* the corresponding > non-final letters. This goes against accepted Hebrew usage. In normal > cases, the non-final letter will be followed by some more letters, so that > there will be a primary difference, but exotic cases will be sorted > improperly. An example that comes to mind is transliteration of > non-Hebrew words. For instance a "zip" file will be transliterated as > "Zayin Yod Pe" (Google gives 2840 hits for this orthograph). There is a > Hebrew word pronounced "zif" (meaning "bristle") which is written > identically except that the last letter is a Final Pe. I expect the "zip" > file to be collated *after* the "bristle", but this will not happen with > the current collation table. > > I would feel more comfortable if: > a) Final letters had a smaller weight than the corresponding non-final > letters (for some level >1). > b) The level associated with final/non-final was more significant than the > level associated with diacritics (Dagesh and/or other Hebrew points). > It is not that I have so many really convincing examples that would be > broken with the current collation definition, but I think that having > weights which reflect the linguistic guidelines is more likely to > successfully handle the cases that we have not considered. > > Shalom (Regards), Mati > Bidi Architect > Globalization Center Of Competency - Bidirectional Scripts > IBM Israel > Phone: +972 2 5888802 Fax: +972 2 5870333 Mobile: +972 52 > 554160 > > > To: Matitiahu Allouche/Israel/[EMAIL PROTECTED] > cc: <[EMAIL PROTECTED]>, <[EMAIL PROTECTED]>, <[EMAIL PROTECTED]> > Subject: Re: [bidi] Re: Unicode Collation Algorithm: 4.0 Update (beta) > > > I'm sorry that you haven't gotten responses before. I have searched > through my > email archive, and can't find anything like the message, and I don't think > it > was brought up to the UTC formally. > > The first one seems odd, and as you say, it would seem to only affect a > vanishingly small number of characters; since these are final character, > one > presumes there would be subsequent characters that would form a larger > difference anyway. > > Mark > > >