Re: [bidi] Re: Unicode Collation Algorithm: 4.0 Update (beta)

Mark Davis Tue, 19 Aug 2003 16:02:00 -0700

I forgot the most important point of all:

The goal for UCA 4.0 is to top it up to the Unicode 4.0 repertoire. The
timeframe for that is quite short -- it was to have been done some time ago -- 
and we don't want to make any changes that we would want to pull out later when
we work with SC22/WG20. So we will only make "safe and obvious" changes in this
version.


Of course, you should still continue to work on any more extensive comments for
a later version, so that they are prepared well in advance; after all, all of
these issues are on collation features that have been in since 3.1 and before!

Mark
__________________________________
http://www.macchiato.com
►  “Eppur si muove” ◄

----- Original Message ----- 
From: "Peter Kirk" <[EMAIL PROTECTED]>
To: "Mark Davis" <[EMAIL PROTECTED]>
Cc: "Matitiahu Allouche" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>;
<[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>; "Joan Wardell" <[EMAIL PROTECTED]>;
<[EMAIL PROTECTED]>
Sent: Tuesday, August 19, 2003 14:55
Subject: Re: [bidi] Re: Unicode Collation Algorithm: 4.0 Update (beta)


> On 19/08/2003 14:23, Mark Davis wrote:
>
> >Three points.
> >
> >First, While we try to make the the UCA collation table (DUCET) as reasonable
as
> >possible for the main languages of a given script, it  is not guaranteed to
> >produce the correct sorting for any particular language. The UCA *is*
designed
> >so that it provides a default base ordering for all of Unicode, and
individual
> >languages can be given tailorings of the DUCET that handle the specifics of
> >their string comparison requirements.
> >
> >Thus if there are changes that improve the handling of the UCA for the major
> >languages using a given script, and do not destabilize others, those are
> >candidates for change in a version. For example, if it turned out that a
> >particular Tamil character (or sequence of characters!) was not sorted
correctly
> >according to the DUCET (e.g. on
http://www.unicode.org/charts/collation/beta/),
> >then it would be a candidate, and should be submitted on the form.
> >
> >
> Understood. On this basis, the DUCET sorting for the Hebrew block should
> be based on the requirements for modern Hebrew, with Yiddish, Ladino etc
> also being taken into acount.
>
> >Second, we do and should favor modern language communities when making
> >incompatible tradeoffs. So if we have the choice between making French sort
> >correctly without tailoring, or have Latin sort correctly without tailoring,
we
> >should choose the modern community. The Latin community can always use a
> >tailored UCA, in any event.
> >
> >
> Understood. I accept the primacy of the modern language in this case.
> There may be some issues on which the modern language has no
> preference, especially for characters only used in older Hebrew, and in
> such cases it would make sense to follow the preferences of ancient
> Hebrew scholars. If it becomes necessary to use a tailored UCA for
> biblical work, so be it, but I would prefer not to. We have come close
> to having to use a separate set of vowels for biblical Hebrew simply
> because decisions were rushed and then frozen on the basis of modern
> Hebrew requirements. I don't want any danger of falling into the same
> kind of trap with collation.
>
> >Third, there is often a serious confusion between sorting weight and
canonical
> >ordering. The fact that a grave accent precedes a cedilla in canonical order
is
> >*completely independent of* whatever collation weights each of them has,
either
> >in a tailoring or in the DUCET. The only substantive issue is how each of
these
> >sorts separately or in combination. And making the combination (sequence) of
> >grave and cedilla sort before grave, after grave, before cedilla, or after
> >cedilla are all possible; all of those can be handled by the UCA as
> >contractions. See http://www.unicode.org/reports/tr10/tr10-10.html for more
> >information.
> >
> >
> Yes, I understand that the collation weights are quite independent of
> the canonical combining classes. But collation  does become trickier
> when the canonical ordering is not the expected one, because of the
> assumption that collation is based on the order of the string i.e. based
> on the first character, then the second etc.
>
> Well, I am glad that contractions provide a way around that problem. So
> perhaps we ought to be looking at using them for Hebrew in DUCET. I
> guess we should consider defining contractions for each case of
> <consonant, dagesh> which differ from the consonant at the second level
> only, perhaps also the same for rafe, and similarly for each combination
> of shin, shin/sin dot and dagesh. The problem comes that the vowels
> intrude between the consonant and the dagesh, and meteg comes before
> shin/sin dot, so there is a potential need for a rather large number of
> contractions, especially if we consider a shin with a right meteg which
> might come out as:
>
> <shin, dagesh, meteg, CGJ, {any one of 11 vowels}, {optional shin dot |
> sin dot}, masora circle>
>
> with the CGJ inhibiting complete canonical reordering, and the shin/sin
> dot must be contracted with the shin.
>
> Perhaps we need to specify that dagesh and shin/sin dot must always come
> BEFORE any CGJ in such combinations so that they don't get separated too
> far from the base character. In fact I think I will change my document
> to specify that.
>
> PS Is there a problem with the Unicode Hebrew list? Nothing seems to
> have appeared on it today, including my previous posting on this thread
> and Mark's reply to it.
>
> -- 
> Peter Kirk
> [EMAIL PROTECTED] (personal)
> [EMAIL PROTECTED] (work)
> http://www.qaya.org/
>
>
>

Re: [bidi] Re: Unicode Collation Algorithm: 4.0 Update (beta)

Reply via email to