I forgot the most important point of all: The goal for UCA 4.0 is to top it up to the Unicode 4.0 repertoire. The timeframe for that is quite short -- it was to have been done some time ago -- and we don't want to make any changes that we would want to pull out later when we work with SC22/WG20. So we will only make "safe and obvious" changes in this version.
Of course, you should still continue to work on any more extensive comments for a later version, so that they are prepared well in advance; after all, all of these issues are on collation features that have been in since 3.1 and before! Mark __________________________________ http://www.macchiato.com ► “Eppur si muove” ◄ ----- Original Message ----- From: "Peter Kirk" <[EMAIL PROTECTED]> To: "Mark Davis" <[EMAIL PROTECTED]> Cc: "Matitiahu Allouche" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>; "Joan Wardell" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Tuesday, August 19, 2003 14:55 Subject: Re: [bidi] Re: Unicode Collation Algorithm: 4.0 Update (beta) > On 19/08/2003 14:23, Mark Davis wrote: > > >Three points. > > > >First, While we try to make the the UCA collation table (DUCET) as reasonable as > >possible for the main languages of a given script, it is not guaranteed to > >produce the correct sorting for any particular language. The UCA *is* designed > >so that it provides a default base ordering for all of Unicode, and individual > >languages can be given tailorings of the DUCET that handle the specifics of > >their string comparison requirements. > > > >Thus if there are changes that improve the handling of the UCA for the major > >languages using a given script, and do not destabilize others, those are > >candidates for change in a version. For example, if it turned out that a > >particular Tamil character (or sequence of characters!) was not sorted correctly > >according to the DUCET (e.g. on http://www.unicode.org/charts/collation/beta/), > >then it would be a candidate, and should be submitted on the form. > > > > > Understood. On this basis, the DUCET sorting for the Hebrew block should > be based on the requirements for modern Hebrew, with Yiddish, Ladino etc > also being taken into acount. > > >Second, we do and should favor modern language communities when making > >incompatible tradeoffs. So if we have the choice between making French sort > >correctly without tailoring, or have Latin sort correctly without tailoring, we > >should choose the modern community. The Latin community can always use a > >tailored UCA, in any event. > > > > > Understood. I accept the primacy of the modern language in this case. > There may be some issues on which the modern language has no > preference, especially for characters only used in older Hebrew, and in > such cases it would make sense to follow the preferences of ancient > Hebrew scholars. If it becomes necessary to use a tailored UCA for > biblical work, so be it, but I would prefer not to. We have come close > to having to use a separate set of vowels for biblical Hebrew simply > because decisions were rushed and then frozen on the basis of modern > Hebrew requirements. I don't want any danger of falling into the same > kind of trap with collation. > > >Third, there is often a serious confusion between sorting weight and canonical > >ordering. The fact that a grave accent precedes a cedilla in canonical order is > >*completely independent of* whatever collation weights each of them has, either > >in a tailoring or in the DUCET. The only substantive issue is how each of these > >sorts separately or in combination. And making the combination (sequence) of > >grave and cedilla sort before grave, after grave, before cedilla, or after > >cedilla are all possible; all of those can be handled by the UCA as > >contractions. See http://www.unicode.org/reports/tr10/tr10-10.html for more > >information. > > > > > Yes, I understand that the collation weights are quite independent of > the canonical combining classes. But collation does become trickier > when the canonical ordering is not the expected one, because of the > assumption that collation is based on the order of the string i.e. based > on the first character, then the second etc. > > Well, I am glad that contractions provide a way around that problem. So > perhaps we ought to be looking at using them for Hebrew in DUCET. I > guess we should consider defining contractions for each case of > <consonant, dagesh> which differ from the consonant at the second level > only, perhaps also the same for rafe, and similarly for each combination > of shin, shin/sin dot and dagesh. The problem comes that the vowels > intrude between the consonant and the dagesh, and meteg comes before > shin/sin dot, so there is a potential need for a rather large number of > contractions, especially if we consider a shin with a right meteg which > might come out as: > > <shin, dagesh, meteg, CGJ, {any one of 11 vowels}, {optional shin dot | > sin dot}, masora circle> > > with the CGJ inhibiting complete canonical reordering, and the shin/sin > dot must be contracted with the shin. > > Perhaps we need to specify that dagesh and shin/sin dot must always come > BEFORE any CGJ in such combinations so that they don't get separated too > far from the base character. In fact I think I will change my document > to specify that. > > PS Is there a problem with the Unicode Hebrew list? Nothing seems to > have appeared on it today, including my previous posting on this thread > and Mark's reply to it. > > -- > Peter Kirk > [EMAIL PROTECTED] (personal) > [EMAIL PROTECTED] (work) > http://www.qaya.org/ > > >