On Wednesday, July 23, 2003 3:00 AM, Rick McGowan <[EMAIL PROTECTED]> wrote:

> Peter Kirk wrote:
> 
> > And then if (and I know it's a big if) the UTC agrees in principle
> > to allow a change to these combining classes, [...]
> 
> A solution with CGJ has been proposed, which is very general and can
> be applied to this and other such situations.

Still waiting for a decision documenting it specifically: could be documented in 
UTF#30 as an oprional mapping transformation for Hebrew, and in UCA, added on top of 
Normalization: sort of an extra normalization checker for Hebrew, that would still 
work to produce a new string still with a normalization form.

If there's an agreement about what should have been the best combining classes, then 
it becomes possible to compare automatically the result from a NF normalization and 
the "corrected" order, and this can automatically insert a CGJ to keep this normal 
order in NF transforms.

It could also be used automatically as a preprocessing order in text renderers, in a 
way similar to what is currently performed in layout engines for Brahmic vowel signs, 
or for Thai (TIS620 based) which already uses a visual encoding order instead of the 
logical order used in other scripts.

So if the correct combining classes are already known, such an algorithm is possible, 
and texts do not need to be reencoded immediately, even if they are lacking the CGJ 
character inserts and transmitted in NF form.

The only problem would be the fact that CGJ can already be used to change the 
combining order for any script (not only Hebrew), and a text could be already encoded 
with CGJ to force the order of combining sequences: can this use of CGJ character be 
checked and corrected so that it will be removed automatically when it is not 
necessary ?

i.e. is it valid to first filter all CGJ characters from a string, then check the 
combining order, correct it and insert only those that are necessary? What would be 
the linguistic impact of such interpretation?

Some cases of non necessary use of CGJ would be for example a sequence like <base, 
CGJ, diacritic>, where <base> is a character with a zero combining class. Applying the 
preious idea would transform it into <base, diacritic> i.e. the "unnecessary" CGJ 
would be removed. This is clearly breaking the NF form identity as Unicode NF forms 
are not supposed to remove any combining character (that's why this transformation 
looks more like a UTR#30 mapping, or a UCA tailoring rule).

-- 
Philippe.
Spams non tolérés: tout message non sollicité sera
rapporté à vos fournisseurs de services Internet.


Reply via email to