From: "Pim Blokland" <[EMAIL PROTECTED]> > However, a couple of paragraphs up, the definition for No-Break > Space says: > > > U+00A0 [No-Break Space] behaves like the following coded > > character sequence: U+FEFF [Zero Width No-Break Space] + > > U+0020 [Space] + U+FEFF [Zero Width No-Break Space]. > > Is this something that has slipped by the editors? Or am I missing > something?
The main word of the sentence is "behave like". That's different from saying it is equivalent (no the statement does not say that NBSP is decomposable, but it just illustrates the non-breaking behavior of NBSP, on both sides, and is to be represented as if it was a normal space). But it's true that NBSP is used to join words, and so a better analogy would to say: > U+00A0 [No-Break Space] behaves like the following coded > character sequence: U+2060 [Word Joiner] + > U+0020 [Space] + U+2060 [Word Joiner]. I think that the wording of this sentence was not modified as it should have been. But this does not constitutes a breach in the standard, as the sentence is mostly informative. Of course, coding a text with <ZWNBSP,SP,ZWNBSP> instead of just <NBSP> would create possible collisions with current BOM. But it is not invalid to use the 3 character sequence in the middle of the text. For UTF encoding schemes that forbid the use of BOM, ZWNBSP (U+FEFF) must be still interpreted exactly like the newer WORD JOINER. There will be no problem with BOM interpretation if a text uses instead <WJ,SP,WJ> even at the begining of text, which is equally valid (even if a WJ at the first position of text looks strange). But there's an opportunity now to use indenting spaces at the begining of lines, which may be rendered in paragraphs by keeping the spacing, if the first WJ is removed from the sequence, and successive WJ are collated into a single one: <SP,WJ,SP,WJ,SP,WJ> would then be encoding _roughly_ (not equivalently...) the same rendered text as: <ZWNJ,NBSP,NBSP,NBSP>