Re: BOM as WJ?
At 05:52 AM 11/20/2003, Philippe Verdy wrote: We need a comprehensive new technical report that lists all the exceptions to the general category system, as these line-breaking or word-breaking or grapheme cluster breaking properties are orthogonal to the basic GC system and to the combining class system. No we don't. The GC is quite limited. It can at best capture the 'primary' classification of a character. For many characters, esp. in category Cf all it knows is that the character has some behavior that could be interesting, but is silent on what that behavior is. The same is largely true for all the P* and Z* classes, where for line and word breaking, the rules are more fine grained. We have two UAXs that deal in detail with these two subjects. Adding a third UAX on top, does not solve a thing. The expectation that you can derive useful knowledge of text and line boundary detection from just GC and CC is misguided. You need additional information. A./
Re: BOM as WJ?
At 05:44 AM 11/19/2003, Philippe Verdy wrote: However, a couple of paragraphs up, the definition for No-Break Space says: U+00A0 [No-Break Space] behaves like the following coded character sequence: U+FEFF [Zero Width No-Break Space] + U+0020 [Space] + U+FEFF [Zero Width No-Break Space]. Is this something that has slipped by the editors? Or am I missing something? The U+FEFF most certainly should have been replaced by WJ in this paragraph. The text is still correct, as FEFF must forever retain its ZWNBSP semantics for backwards compatibility, but it flies in the face of our attempt to discourage its use in favor of WJ. A./
Re: BOM as WJ?
On 19/11/2003 17:44, Philippe Verdy wrote: ... This trick doesn't work if any of the CC's are in combining class zero. Of course, but which combining character of combining class 0 does need to combine with NBSP in a way that affect renderers? Do you think about sequences like NBSP,CGJ? Or about issues when rendering 07A6;THAANA ABAFILI;Mn;0;NSM;N; after NBSP which of wourse would be handled only as WJ,SP,WJ,THAANA ABAFILI ? Or about: 0901;DEVANAGARI SIGN CANDRABINDU;Mn;0;NSM;N; after NBSP rendered as if it was WJ,SP,WJ,CANDRABINDU ? Or about 0903;DEVANAGARI SIGN VISARGA;Mc;0;L;N; after NBSP which is this time a Mc character ? Or about all the Indic vowels which do not seem to be really combining on NBSP but would be rendered as a space followed by a defective isolated form of the vowel (so without vowel glyphs reordering around the space) ? Just curious... I wasn't thinking of any specific combining character. But I was thinking of the general principle that if one wants to display an isolated diacritic glyph, which is possible in principle, at least in paradigm lists (and code charts!), for any of the characters you list above, the recommended way of doing so is to apply them to SP or NBSP. Unfortunately there are many problems and undesirable side effects of this recommendation. If we just say that NBSP behaves in all cases in renderers as if it was WJ,SP,WJ where WJ is reordered with a pseudo-combining class 256, it solves much problems with the interpretation of NBSP, and it looks like if NBSP was a space letter; however NBSP is not a Lo character but really a Zs whitespace and thus justifiable out of the end margin; also NBSP does not prohibit word break but only line breaks), so it is more like if it was in fact: LJ,SP,LJ where LJ is a line-joiner, distinct also from ZWJ (zero-width joiner) used to hint ligatures but which does not brohibit any break. Well, WJ itself is actually LJ, because, astonishingly, it does not prohibit word breaks (see UAX29). Similarly ZWNBS, ZWJ, and ZWNJ. As format characters these are ignored when finding word breaks. The implication is that A,B,WJ,C,D is a single word, but A,B,WJ,SPACE,WJ,C,D and A,B,WJ,$,WJ,C,D are both two words despite the obvious attempt to use WJ to force these to be understood as one word (and despite the existence of alphabets in which $ is considered alphabetic). As for line breaking (UAX14), WJ explicitly prohibits this; ZWJ and ZWNJ are not listed, and so as Cf characters are ignored in the line breaking algorithm. I note also that the combining mark CGJ is listed as GL and so is not CM. The descriptive text of rules LB7a-c implies that CM = combining mark whereas this is not in fact true; some combining marks are not CM and some CM are not combining marks. In rule LB7b the term combining character sequence is used, contrary to its regular defined meaning, for a sequence of CM characters and the preceding non-CM character. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: BOM as WJ?
From: Peter Kirk [EMAIL PROTECTED] As for line breaking (UAX14), WJ explicitly prohibits this; ZWJ and ZWNJ are not listed, and so as Cf characters are ignored in the line breaking algorithm. I note also that the combining mark CGJ is listed as GL and so is not CM. The descriptive text of rules LB7a-c implies that CM = combining mark whereas this is not in fact true; some combining marks are not CM and some CM are not combining marks. In rule LB7b the term combining character sequence is used, contrary to its regular defined meaning, for a sequence of CM characters and the preceding non-CM character. Other proofs that even the Unicode exact terminology is to be used with extreme care, as there are many exceptions, even in _standard_ technical reports such as UAX's. If it was possible, I would suggest performing an audit of the terminology and classification of all character categories, including in the UTS. It's just too much complicate for now to comply to each UTR (or only to UAX and UTS), as one need to check simultaneously a lot of sometime conflicting properties used by various technical reports. We need a comprehensive new technical report that lists all the exceptions to the general category system, as these line-breaking or word-breaking or grapheme cluster breaking properties are orthogonal to the basic GC system and to the combining class system.
BOM as WJ?
In the online 4.0 book, chapter 15 http://www.unicode.org/versions/Unicode4.0.0/ch15.pdf the definition for Word Joiner says: Until Unicode 3.1.1, U+FEFF was the only code point with word joining semantics, but because it is more commonly used as byte order mark, the use of U+2060 [word joiner] to indicate word joining is strongly preferred for any new text. However, a couple of paragraphs up, the definition for No-Break Space says: U+00A0 [No-Break Space] behaves like the following coded character sequence: U+FEFF [Zero Width No-Break Space] + U+0020 [Space] + U+FEFF [Zero Width No-Break Space]. Is this something that has slipped by the editors? Or am I missing something? Pim Blokland
Re: BOM as WJ?
From: Pim Blokland [EMAIL PROTECTED] However, a couple of paragraphs up, the definition for No-Break Space says: U+00A0 [No-Break Space] behaves like the following coded character sequence: U+FEFF [Zero Width No-Break Space] + U+0020 [Space] + U+FEFF [Zero Width No-Break Space]. Is this something that has slipped by the editors? Or am I missing something? The main word of the sentence is behave like. That's different from saying it is equivalent (no the statement does not say that NBSP is decomposable, but it just illustrates the non-breaking behavior of NBSP, on both sides, and is to be represented as if it was a normal space). But it's true that NBSP is used to join words, and so a better analogy would to say: U+00A0 [No-Break Space] behaves like the following coded character sequence: U+2060 [Word Joiner] + U+0020 [Space] + U+2060 [Word Joiner]. I think that the wording of this sentence was not modified as it should have been. But this does not constitutes a breach in the standard, as the sentence is mostly informative. Of course, coding a text with ZWNBSP,SP,ZWNBSP instead of just NBSP would create possible collisions with current BOM. But it is not invalid to use the 3 character sequence in the middle of the text. For UTF encoding schemes that forbid the use of BOM, ZWNBSP (U+FEFF) must be still interpreted exactly like the newer WORD JOINER. There will be no problem with BOM interpretation if a text uses instead WJ,SP,WJ even at the begining of text, which is equally valid (even if a WJ at the first position of text looks strange). But there's an opportunity now to use indenting spaces at the begining of lines, which may be rendered in paragraphs by keeping the spacing, if the first WJ is removed from the sequence, and successive WJ are collated into a single one: SP,WJ,SP,WJ,SP,WJ would then be encoding _roughly_ (not equivalently...) the same rendered text as: ZWNJ,NBSP,NBSP,NBSP
Re: BOM as WJ?
On 19/11/2003 01:49, Pim Blokland wrote: In the online 4.0 book, chapter 15 http://www.unicode.org/versions/Unicode4.0.0/ch15.pdf the definition for Word Joiner says: Until Unicode 3.1.1, U+FEFF was the only code point with word joining semantics, but because it is more commonly used as byte order mark, the use of U+2060 [word joiner] to indicate word joining is strongly preferred for any new text. Perhaps this depends what is meant by word joining semantics. I would presume this to imply that a word boundary is not permitted at this point, but in fact on the current definitions in UAX29 (http://www.unicode.org/reports/tr29/tr29-5.html) ZWNBS, WJ and NBSP are all treated as word boundary characters. However, a couple of paragraphs up, the definition for No-Break Space says: U+00A0 [No-Break Space] behaves like the following coded character sequence: U+FEFF [Zero Width No-Break Space] + U+0020 [Space] + U+FEFF [Zero Width No-Break Space]. Is this something that has slipped by the editors? Or am I missing something? Pim Blokland Does this equivalence hold when combining characters are applied to the NBSP? Is the sequence NBSP, CC (recommended for spacing diacritics, where CC is any sequence of combining characters) equivalent to ZWNBS, SP, ZWNBS, CC? Or should the equivalence be to ZWNBS, SP, CC, ZWNBS? Is it legal to combine combining characters with ZWNBS, or WJ, and how should this be rendered? -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: BOM as WJ?
From: Peter Kirk [EMAIL PROTECTED] Does this equivalence hold when combining characters are applied to the NBSP? Is the sequence NBSP, CC (recommended for spacing diacritics, where CC is any sequence of combining characters) equivalent to ZWNBS, SP, ZWNBS, CC? Or should the equivalence be to ZWNBS, SP, CC, ZWNBS? Is it legal to combine combining characters with ZWNBS, or WJ, and how should this be rendered? This is not an equivalence: despite NBSP should be treated as if it was WJ,SP,WJ when it is found isolately, this does not apply when it is followed by a combining character (CC). So, NBSP,CC must not be treated as if it was: WJ,SP,WJ,CC but really rather as: WJ,SP,CC,WJ Note here the inversion. Note also that all these sequences are NOT canonically equivalent, meaning that it is impossible to define a formal equivalence between NBSP and ZW,SP,WJ.
Re: BOM as WJ?
From: Philippe Verdy [EMAIL PROTECTED] So, NBSP,CC must not be treated as if it was: WJ,SP,WJ,CC but really rather as: WJ,SP,CC,WJ Note here the inversion. The inversion here acts as if WJ was a combining character of combining class 256 (i.e. with a class higher than the combining class of all other Mn combining characters) and a canonical reordering was applied to the sequence. Of course this is not a standard normalization form, but using this pseudo combining class may help render the last two coded strings (in my quote above) equivalently in renderers. This works even in the case where there are multiple diacritics (noted CC1 and CC2 below): NBSP,CC1,CC2 is then treated as if it was: WJ,SP,WJ,CC1,CC2 and then the pseudo-normalization had given: WJ,SP,CC1,CC2,WJ or: WJ,SP,CC2,CC1,WJ (depending on the canonical reordering of CC1 and CC2, i.e. of their relative combining class)
Re: BOM as WJ?
On 19/11/2003 16:26, Philippe Verdy wrote: From: Philippe Verdy [EMAIL PROTECTED] So, NBSP,CC must not be treated as if it was: WJ,SP,WJ,CC but really rather as: WJ,SP,CC,WJ Note here the inversion. The inversion here acts as if WJ was a combining character of combining class 256 (i.e. with a class higher than the combining class of all other Mn combining characters) and a canonical reordering was applied to the sequence. Of course this is not a standard normalization form, but using this pseudo combining class may help render the last two coded strings (in my quote above) equivalently in renderers. This works even in the case where there are multiple diacritics (noted CC1 and CC2 below): NBSP,CC1,CC2 is then treated as if it was: WJ,SP,WJ,CC1,CC2 and then the pseudo-normalization had given: WJ,SP,CC1,CC2,WJ or: WJ,SP,CC2,CC1,WJ (depending on the canonical reordering of CC1 and CC2, i.e. of their relative combining class) This trick doesn't work if any of the CC's are in combining class zero. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: BOM as WJ?
From: Peter Kirk [EMAIL PROTECTED] Of course this is not a standard normalization form, but using this pseudo combining class may help render the last two coded strings (in my quote above) equivalently in renderers. This works even in the case where there are multiple diacritics (noted CC1 and CC2 below): NBSP,CC1,CC2 is then treated as if it was: WJ,SP,WJ,CC1,CC2 and then the pseudo-normalization had given: WJ,SP,CC1,CC2,WJ or: WJ,SP,CC2,CC1,WJ (depending on the canonical reordering of CC1 and CC2, i.e. of their relative combining class) This trick doesn't work if any of the CC's are in combining class zero. Of course, but which combining character of combining class 0 does need to combine with NBSP in a way that affect renderers? Do you think about sequences like NBSP,CGJ? Or about issues when rendering 07A6;THAANA ABAFILI;Mn;0;NSM;N; after NBSP which of wourse would be handled only as WJ,SP,WJ,THAANA ABAFILI ? Or about: 0901;DEVANAGARI SIGN CANDRABINDU;Mn;0;NSM;N; after NBSP rendered as if it was WJ,SP,WJ,CANDRABINDU ? Or about 0903;DEVANAGARI SIGN VISARGA;Mc;0;L;N; after NBSP which is this time a Mc character ? Or about all the Indic vowels which do not seem to be really combining on NBSP but would be rendered as a space followed by a defective isolated form of the vowel (so without vowel glyphs reordering around the space) ? Just curious... If we just say that NBSP behaves in all cases in renderers as if it was WJ,SP,WJ where WJ is reordered with a pseudo-combining class 256, it solves much problems with the interpretation of NBSP, and it looks like if NBSP was a space letter; however NBSP is not a Lo character but really a Zs whitespace and thus justifiable out of the end margin; also NBSP does not prohibit word break but only line breaks), so it is more like if it was in fact: LJ,SP,LJ where LJ is a line-joiner, distinct also from ZWJ (zero-width joiner) used to hint ligatures but which does not brohibit any break.