On Thursday, August 07, 2003 8:06 PM, Peter Kirk <[EMAIL PROTECTED]> wrote:
> On 06/08/2003 15:47, Philippe Verdy wrote: > > > On Wednesday, August 06, 2003 11:48 PM, Peter Kirk > > <[EMAIL PROTECTED]> wrote: > > > > > > > > > OK, what kind of markup should I use, in any well-known markup > > > language, to ensure that an isolated diacritic is centred in the > > > space between the words before and after it? > > > > > > > > > > In plain text, I think that this encoding: > > ...endOfWord1, SPACE, SPACE, diacritic, SPACE, > > startOfWord2... > > is what you need, as it creates the following combining sequences: > > <...endOfWord1>, <SPACE>, <SPACE, diacritic>, <SPACE>, > > <startOfWord2...> > > > > > Thank you, Philippe. This is where we started. But I noted that some > current implementations render the space diacritic combination as a > full > width space with the diacritic not centred over it. I suggested that > this was wrong, that the diacritic should be centred. Doug suggested I > used markup outside the scope of Unicode. > > > ... > > > > Another similar case would be the use of a isolated nukta (which > > normally modifies a following base character): the sequence > > <nukta, SPACE> is a single combining sequence with a break > > opportunity. So a sequence like <nukta, SPACE, acute accent> > > would be unbreakable but would include a break opportunity at its > > end, unless it is followed by a NBSP. > > And the sequence <nukta, NBSP, acute accent> would also be > > unbreakable either in the middle or on both ends. > > > > > > > Tell me more about these nuktas which modify a FOLLOWING base > character. > This is just what I have been told is illegal, non-conformant or > something. But if this is allowed for nuktas, why shouldn't it be > allowed for Hebrew holam? Sorry, I should have checked my code to see which character exactly has a combining feature with the following base character. In fact there's already a special treatment for nukta, which gets internally swapped in front of its base character for glyph processing, and this was a source of confusion for me (yes nuktas have CC=7 and are combined with the previous base character, but only with the standard Unicode encoding sequence, but not in all legacy codepages, and not for some other text processings that put it in front. In fact, I may have discussed about the Candrabindu, which is combining with CC=230 (above?), except in the Devenagari, Bengali, Gujarati, Oriya scripts where they are combining but as base character (CC=0), and in Telugu and Gurmukhi (Adak Bindi) where it is Mc instead of Mn and is not combining. This reflects a different usage of the Candrabindu in ISCII, and this is a source of difficulty when transcoding from ISCII to Unicode... And I'm not sure if the CC=230 for the Tibetan Candrabindu is really accurate with its specific combining model. The treatment of Anusvara (or Tibetan JeSuNgaRo or Gurmukhi Bindi or Sinhala Anusvaraya) as a combining character with CC=0 is also script specific, as it is either Mc or Mn. The same thing may be said about Visarga signs (or Sinhala Visargaya) Such special treatment is not needed for the Viramas (CC=9), as it more or less behaves like a standard vowel sign, i.e. a regular diacritic. The original encoding model for Indian scripts has lot of legacy text resources coded with ISCII with a unified model that Unicode treat more or less specially, but with its own difficulties (we can ignore the ISCII font controls, or we can consider other ISCII control signs to manage it like ISO2022 with script switch controls, which are not encoded in Unicode. Despite what the Unicode reference section documents in the specific chapter for Brahmic scripts, there's little help here to avoid the confusions, notably because the same chapter covers scripts that have been encoded with distinct character models (notably Thai and Lao). For now the current text in Unicode 3 seems not very helpful to disambiguate things, and I hope that this chapter about Indic scripts will be greatly enhanced to cover the actual usages, and that Thai and Lao will be discussed separately from other Indic scripts. For now, I think that the ISCII or TIS620 standards are much more precise and helpful than the Unicode reference for the scripts they cover in a different way, with lots of conversion caveats not explained (at first read this chapter seems to make a proeminent reference to ISCII and TIS620, but there are some "quirks" where both references seem to contradict the actual usage of combining sequences, for which new Unicode properties should be added and precised (even if combining classes cannot be changed for stability reason as well as normalized forms considered canonnically equivalent, or distinct when in reality they are combining the same way and one form is considered "normal" and others are non-standard or defective according to the origin ISCII or TIS620 standard). -- Philippe. Spams non tolérés: tout message non sollicité sera rapporté à vos fournisseurs de services Internet.