On Wednesday, July 23, 2003 3:00 AM, Rick McGowan <[EMAIL PROTECTED]> wrote:
> Peter Kirk wrote: > > > And then if (and I know it's a big if) the UTC agrees in principle > > to allow a change to these combining classes, [...] > > A solution with CGJ has been proposed, which is very general and can > be applied to this and other such situations. Still waiting for a decision documenting it specifically: could be documented in UTF#30 as an oprional mapping transformation for Hebrew, and in UCA, added on top of Normalization: sort of an extra normalization checker for Hebrew, that would still work to produce a new string still with a normalization form. If there's an agreement about what should have been the best combining classes, then it becomes possible to compare automatically the result from a NF normalization and the "corrected" order, and this can automatically insert a CGJ to keep this normal order in NF transforms. It could also be used automatically as a preprocessing order in text renderers, in a way similar to what is currently performed in layout engines for Brahmic vowel signs, or for Thai (TIS620 based) which already uses a visual encoding order instead of the logical order used in other scripts. So if the correct combining classes are already known, such an algorithm is possible, and texts do not need to be reencoded immediately, even if they are lacking the CGJ character inserts and transmitted in NF form. The only problem would be the fact that CGJ can already be used to change the combining order for any script (not only Hebrew), and a text could be already encoded with CGJ to force the order of combining sequences: can this use of CGJ character be checked and corrected so that it will be removed automatically when it is not necessary ? i.e. is it valid to first filter all CGJ characters from a string, then check the combining order, correct it and insert only those that are necessary? What would be the linguistic impact of such interpretation? Some cases of non necessary use of CGJ would be for example a sequence like <base, CGJ, diacritic>, where <base> is a character with a zero combining class. Applying the preious idea would transform it into <base, diacritic> i.e. the "unnecessary" CGJ would be removed. This is clearly breaking the NF form identity as Unicode NF forms are not supposed to remove any combining character (that's why this transformation looks more like a UTR#30 mapping, or a UCA tailoring rule). -- Philippe. Spams non tolérés: tout message non sollicité sera rapporté à vos fournisseurs de services Internet.