Re: BOM as WJ?

Peter Kirk Thu, 20 Nov 2003 06:58:48 -0800

On 19/11/2003 17:44, Philippe Verdy wrote:

...

This trick doesn't work if any of the CC's are in combining class zero.


Of course, but which combining character of combining class 0 does need to
combine with NBSP in a way that affect renderers?

Do you think about sequences like <NBSP,CGJ>?

Or about issues when rendering <07A6;THAANA ABAFILI;Mn;0;NSM;;;;;N;;;;;>
after <NBSP>
which of wourse would be handled only as <WJ,SP,WJ,THAANA ABAFILI> ?

Or about: <0901;DEVANAGARI SIGN CANDRABINDU;Mn;0;NSM;;;;;N;;;;;> after
<NBSP>
rendered as if it was <WJ,SP,WJ,CANDRABINDU> ?

Or about <0903;DEVANAGARI SIGN VISARGA;Mc;0;L;;;;;N;;;;;> after <NBSP>
which is this time a "Mc" character ?

Or about all the Indic vowels which do not seem to be really combining on
NBSP but would be rendered as a space followed by a defective isolated form
of the vowel (so without vowel glyphs reordering around the space) ?

Just curious...

I wasn't thinking of any specific combining character. But I was thinking of the general principle that if one wants to display an isolated diacritic glyph, which is possible in principle, at least in paradigm lists (and code charts!), for any of the characters you list above, the recommended way of doing so is to apply them to SP or NBSP. Unfortunately there are many problems and undesirable side effects of this recommendation.

If we just say that <NBSP> behaves in all cases in renderers as if it was <WJ,SP,WJ> where WJ is reordered with a pseudo-combining class 256, it solves much problems with the interpretation of NBSP, and it looks like if NBSP was a space letter; however NBSP is not a "Lo" character but really a "Zs" whitespace and thus justifiable out of the end margin; also NBSP does not prohibit word break but only line breaks), so it is more like if it was in fact: <LJ,SP,LJ> where LJ is a line-joiner, distinct also from ZWJ (zero-width joiner) used to hint ligatures but which does not brohibit any break.

Well, WJ itself is actually LJ, because, astonishingly, it does not prohibit word breaks (see UAX29). Similarly ZWNBS, ZWJ, and ZWNJ. As format characters these are ignored when finding word breaks. The implication is that <A,B,WJ,C,D> is a single word, but <A,B,WJ,SPACE,WJ,C,D> and <A,B,WJ,$,WJ,C,D> are both two words despite the obvious attempt to use WJ to force these to be understood as one word (and despite the existence of alphabets in which "$" is considered alphabetic).

As for line breaking (UAX14), WJ explicitly prohibits this; ZWJ and ZWNJ are not listed, and so as Cf characters are ignored in the line breaking algorithm. I note also that the combining mark CGJ is listed as GL and so is not CM. The descriptive text of rules LB7a-c implies that CM = combining mark whereas this is not in fact true; some combining marks are not CM and some CM are not combining marks. In rule LB7b the term "combining character sequence" is used, contrary to its regular defined meaning, for a sequence of CM characters and the preceding non-CM character.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: BOM as WJ?

Reply via email to