Re: BOM as WJ?

2003-11-21 Thread Asmus Freytag
At 05:52 AM 11/20/2003, Philippe Verdy wrote:
We need a comprehensive new technical report that lists all the exceptions
to the general category system, as these line-breaking or word-breaking or
grapheme cluster breaking properties are orthogonal to the basic GC system
and to the combining class system.
No we don't.

The GC is quite limited. It can at best capture the 'primary' classification
of a character. For many characters, esp. in category Cf all it knows is
that the character has some behavior that could be interesting, but is silent
on what that behavior is. The same is largely true for all the P* and Z*
classes, where for line and word breaking, the rules are more fine grained.
We have two UAXs that deal in detail with these two subjects. Adding a third
UAX on top, does not solve a thing.
The expectation that you can derive useful knowledge of text and line boundary
detection from just GC and CC is misguided. You need additional information.
A./ 





Re: BOM as WJ?

2003-11-21 Thread Asmus Freytag
At 05:44 AM 11/19/2003, Philippe Verdy wrote:
 However, a couple of paragraphs up, the definition for No-Break
 Space says:

  U+00A0 [No-Break Space] behaves like the following coded
  character sequence: U+FEFF [Zero Width No-Break Space] +
  U+0020 [Space] + U+FEFF [Zero Width No-Break Space].

 Is this something that has slipped by the editors? Or am I missing
 something?
The U+FEFF most certainly should have been replaced by WJ in this
paragraph. The text is still correct, as FEFF must forever retain
its ZWNBSP semantics for backwards compatibility, but it flies
in the face of our attempt to discourage its use in favor of WJ.
A./ 





Re: BOM as WJ?

2003-11-20 Thread Peter Kirk
On 19/11/2003 17:44, Philippe Verdy wrote:

...

This trick doesn't work if any of the CC's are in combining class zero.
   

Of course, but which combining character of combining class 0 does need to
combine with NBSP in a way that affect renderers?
Do you think about sequences like NBSP,CGJ?

Or about issues when rendering 07A6;THAANA ABAFILI;Mn;0;NSM;N;
after NBSP
which of wourse would be handled only as WJ,SP,WJ,THAANA ABAFILI ?
Or about: 0901;DEVANAGARI SIGN CANDRABINDU;Mn;0;NSM;N; after
NBSP
rendered as if it was WJ,SP,WJ,CANDRABINDU ?
Or about 0903;DEVANAGARI SIGN VISARGA;Mc;0;L;N; after NBSP
which is this time a Mc character ?
Or about all the Indic vowels which do not seem to be really combining on
NBSP but would be rendered as a space followed by a defective isolated form
of the vowel (so without vowel glyphs reordering around the space) ?
Just curious...
 

I wasn't thinking of any specific combining character. But I was 
thinking of the general principle that if one wants to display an 
isolated diacritic glyph, which is possible in principle, at least in 
paradigm lists (and code charts!), for any of the characters you list 
above, the recommended way of doing so is to apply them to SP or NBSP. 
Unfortunately there are many problems and undesirable side effects of 
this recommendation.

If we just say that NBSP behaves in all cases in renderers as if it was
WJ,SP,WJ where WJ is reordered with a pseudo-combining class 256, it
solves much problems with the interpretation of NBSP, and it looks like if
NBSP was a space letter; however NBSP is not a Lo character but really a
Zs whitespace and thus justifiable out of the end margin; also NBSP does
not prohibit word break but only line breaks), so it is more like if it was
in fact: LJ,SP,LJ where LJ is a line-joiner, distinct also from ZWJ
(zero-width joiner) used to hint ligatures but which does not brohibit any
break.
 

Well, WJ itself is actually LJ, because, astonishingly, it does not 
prohibit word breaks (see UAX29). Similarly ZWNBS, ZWJ, and ZWNJ. As 
format characters these are ignored when finding word breaks. The 
implication is that A,B,WJ,C,D is a single word, but 
A,B,WJ,SPACE,WJ,C,D and A,B,WJ,$,WJ,C,D are both two words despite 
the obvious attempt to use WJ to force these to be understood as one 
word (and despite the existence of alphabets in which $ is considered 
alphabetic).

As for line breaking (UAX14), WJ explicitly prohibits this; ZWJ and ZWNJ 
are not listed, and so as Cf characters are ignored in the line breaking 
algorithm. I note also that the combining mark CGJ is listed as GL and 
so is not CM. The descriptive text of rules LB7a-c implies that CM = 
combining mark whereas this is not in fact true; some combining marks 
are not CM and some CM are not combining marks. In rule LB7b the term 
combining character sequence is used, contrary to its regular defined 
meaning, for a sequence of CM characters and the preceding non-CM character.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/




Re: BOM as WJ?

2003-11-20 Thread Philippe Verdy
From: Peter Kirk [EMAIL PROTECTED]
 As for line breaking (UAX14), WJ explicitly prohibits this; ZWJ and ZWNJ
 are not listed, and so as Cf characters are ignored in the line breaking
 algorithm. I note also that the combining mark CGJ is listed as GL and
 so is not CM. The descriptive text of rules LB7a-c implies that CM =
 combining mark whereas this is not in fact true; some combining marks
 are not CM and some CM are not combining marks. In rule LB7b the term
 combining character sequence is used, contrary to its regular defined
 meaning, for a sequence of CM characters and the preceding non-CM
character.

Other proofs that even the Unicode exact terminology is to be used with
extreme care, as there are many exceptions, even in _standard_ technical
reports such as UAX's.

If it was possible, I would suggest performing an audit of the terminology
and classification of all character categories, including in the UTS. It's
just too much complicate for now to comply to each UTR (or only to UAX and
UTS), as one need to check simultaneously a lot of sometime conflicting
properties used by various technical reports.

We need a comprehensive new technical report that lists all the exceptions
to the general category system, as these line-breaking or word-breaking or
grapheme cluster breaking properties are orthogonal to the basic GC system
and to the combining class system.




BOM as WJ?

2003-11-19 Thread Pim Blokland
In the online 4.0 book, chapter 15

http://www.unicode.org/versions/Unicode4.0.0/ch15.pdf

the definition for Word Joiner says:

 Until Unicode 3.1.1, U+FEFF was the only code point with word
 joining semantics, but because it is more commonly used as
 byte order mark, the use of U+2060 [word joiner] to indicate
 word joining is strongly preferred for any new text.

However, a couple of paragraphs up, the definition for No-Break
Space says:

 U+00A0 [No-Break Space] behaves like the following coded
 character sequence: U+FEFF [Zero Width No-Break Space] +
 U+0020 [Space] + U+FEFF [Zero Width No-Break Space].

Is this something that has slipped by the editors? Or am I missing
something?

Pim Blokland




Re: BOM as WJ?

2003-11-19 Thread Philippe Verdy
From: Pim Blokland [EMAIL PROTECTED]
 However, a couple of paragraphs up, the definition for No-Break
 Space says:

  U+00A0 [No-Break Space] behaves like the following coded
  character sequence: U+FEFF [Zero Width No-Break Space] +
  U+0020 [Space] + U+FEFF [Zero Width No-Break Space].

 Is this something that has slipped by the editors? Or am I missing
 something?

The main word of the sentence is behave like. That's different from saying
it is equivalent (no the statement does not say that NBSP is decomposable,
but it just illustrates the non-breaking behavior of NBSP, on both sides,
and is to be represented as if it was a normal space).

But it's true that NBSP is used to join words, and so a better analogy would
to say:

 U+00A0 [No-Break Space] behaves like the following coded
 character sequence: U+2060 [Word Joiner] +
 U+0020 [Space] + U+2060 [Word Joiner].

I think that the wording of this sentence was not modified as it should have
been. But this does not constitutes a breach in the standard, as the
sentence is mostly informative.

Of course, coding a text with ZWNBSP,SP,ZWNBSP instead of just NBSP
would create possible collisions with current BOM. But it is not invalid to
use the 3 character sequence in the middle of the text. For UTF encoding
schemes that forbid the use of BOM, ZWNBSP (U+FEFF) must be still
interpreted exactly like the newer WORD JOINER.
There will be no problem with BOM interpretation if a text uses instead
WJ,SP,WJ even at the begining of text, which is equally valid (even if a
WJ at the first position of text looks strange).

But there's an opportunity now to use indenting spaces at the begining of
lines, which may be rendered in paragraphs by keeping the spacing, if the
first WJ is removed from the sequence, and successive WJ are collated into a
single one:
SP,WJ,SP,WJ,SP,WJ would then be encoding _roughly_ (not equivalently...)
the same rendered text as:
ZWNJ,NBSP,NBSP,NBSP




Re: BOM as WJ?

2003-11-19 Thread Peter Kirk
On 19/11/2003 01:49, Pim Blokland wrote:

In the online 4.0 book, chapter 15

http://www.unicode.org/versions/Unicode4.0.0/ch15.pdf

the definition for Word Joiner says:

 

Until Unicode 3.1.1, U+FEFF was the only code point with word
joining semantics, but because it is more commonly used as
byte order mark, the use of U+2060 [word joiner] to indicate
word joining is strongly preferred for any new text.
   

 

Perhaps this depends what is meant by word joining semantics. I would 
presume this to imply that a word boundary is not permitted at this 
point, but in fact on the current definitions in UAX29 
(http://www.unicode.org/reports/tr29/tr29-5.html) ZWNBS, WJ and NBSP are 
all treated as word boundary characters.

However, a couple of paragraphs up, the definition for No-Break
Space says:
 

U+00A0 [No-Break Space] behaves like the following coded
character sequence: U+FEFF [Zero Width No-Break Space] +
U+0020 [Space] + U+FEFF [Zero Width No-Break Space].
   

Is this something that has slipped by the editors? Or am I missing
something?
Pim Blokland
 

Does this equivalence hold when combining characters are applied to the 
NBSP? Is the sequence NBSP, CC (recommended for spacing diacritics, 
where CC is any sequence of combining characters) equivalent to ZWNBS, 
SP, ZWNBS, CC? Or should the equivalence be to ZWNBS, SP, CC, ZWNBS? 
Is it legal to combine combining characters with ZWNBS, or WJ, and how 
should this be rendered?

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/




Re: BOM as WJ?

2003-11-19 Thread Philippe Verdy
From: Peter Kirk [EMAIL PROTECTED]
 Does this equivalence hold when combining characters are applied to the
 NBSP? Is the sequence NBSP, CC (recommended for spacing diacritics,
 where CC is any sequence of combining characters) equivalent to ZWNBS,
 SP, ZWNBS, CC? Or should the equivalence be to ZWNBS, SP, CC, ZWNBS?
 Is it legal to combine combining characters with ZWNBS, or WJ, and how
 should this be rendered?

This is not an equivalence: despite NBSP should be treated as if it was
WJ,SP,WJ
when it is found isolately, this does not apply when it is followed by a
combining character (CC). So, NBSP,CC must not be treated as if it was:
WJ,SP,WJ,CC
but really rather as:
WJ,SP,CC,WJ
Note here the inversion. Note also that all these sequences are NOT
canonically equivalent, meaning that it is impossible to define a formal
equivalence between NBSP and ZW,SP,WJ.




Re: BOM as WJ?

2003-11-19 Thread Philippe Verdy
From: Philippe Verdy [EMAIL PROTECTED]
 So, NBSP,CC must not be treated as if it was:
 WJ,SP,WJ,CC
 but really rather as:
 WJ,SP,CC,WJ
 Note here the inversion.

The inversion here acts as if WJ was a combining character of combining
class 256 (i.e. with a class higher than the combining class of all other
Mn combining characters) and a canonical reordering was applied to the
sequence.

Of course this is not a standard normalization form, but using this pseudo
combining class may help render the last two coded strings (in my quote
above) equivalently in renderers.
This works even in the case where there are multiple diacritics (noted CC1
and CC2 below):
NBSP,CC1,CC2
is then treated as if it was:
WJ,SP,WJ,CC1,CC2
and then the pseudo-normalization had given:
WJ,SP,CC1,CC2,WJ
or:
WJ,SP,CC2,CC1,WJ
(depending on the canonical reordering of CC1 and CC2, i.e. of their
relative combining class)




Re: BOM as WJ?

2003-11-19 Thread Peter Kirk
On 19/11/2003 16:26, Philippe Verdy wrote:

From: Philippe Verdy [EMAIL PROTECTED]
 

So, NBSP,CC must not be treated as if it was:
   WJ,SP,WJ,CC
but really rather as:
   WJ,SP,CC,WJ
Note here the inversion.
   

The inversion here acts as if WJ was a combining character of combining
class 256 (i.e. with a class higher than the combining class of all other
Mn combining characters) and a canonical reordering was applied to the
sequence.
Of course this is not a standard normalization form, but using this pseudo
combining class may help render the last two coded strings (in my quote
above) equivalently in renderers.
This works even in the case where there are multiple diacritics (noted CC1
and CC2 below):
   NBSP,CC1,CC2
is then treated as if it was:
   WJ,SP,WJ,CC1,CC2
and then the pseudo-normalization had given:
   WJ,SP,CC1,CC2,WJ
or:
   WJ,SP,CC2,CC1,WJ
(depending on the canonical reordering of CC1 and CC2, i.e. of their
relative combining class)


 

This trick doesn't work if any of the CC's are in combining class zero.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/




Re: BOM as WJ?

2003-11-19 Thread Philippe Verdy
From: Peter Kirk [EMAIL PROTECTED]
 Of course this is not a standard normalization form, but using this
pseudo
 combining class may help render the last two coded strings (in my quote
 above) equivalently in renderers.
 This works even in the case where there are multiple diacritics (noted
CC1
 and CC2 below):
 NBSP,CC1,CC2
 is then treated as if it was:
 WJ,SP,WJ,CC1,CC2
 and then the pseudo-normalization had given:
 WJ,SP,CC1,CC2,WJ
 or:
 WJ,SP,CC2,CC1,WJ
 (depending on the canonical reordering of CC1 and CC2, i.e. of their
 relative combining class)

 This trick doesn't work if any of the CC's are in combining class zero.

Of course, but which combining character of combining class 0 does need to
combine with NBSP in a way that affect renderers?

Do you think about sequences like NBSP,CGJ?

Or about issues when rendering 07A6;THAANA ABAFILI;Mn;0;NSM;N;
after NBSP
which of wourse would be handled only as WJ,SP,WJ,THAANA ABAFILI ?

Or about: 0901;DEVANAGARI SIGN CANDRABINDU;Mn;0;NSM;N; after
NBSP
rendered as if it was WJ,SP,WJ,CANDRABINDU ?

Or about 0903;DEVANAGARI SIGN VISARGA;Mc;0;L;N; after NBSP
which is this time a Mc character ?

Or about all the Indic vowels which do not seem to be really combining on
NBSP but would be rendered as a space followed by a defective isolated form
of the vowel (so without vowel glyphs reordering around the space) ?

Just curious...

If we just say that NBSP behaves in all cases in renderers as if it was
WJ,SP,WJ where WJ is reordered with a pseudo-combining class 256, it
solves much problems with the interpretation of NBSP, and it looks like if
NBSP was a space letter; however NBSP is not a Lo character but really a
Zs whitespace and thus justifiable out of the end margin; also NBSP does
not prohibit word break but only line breaks), so it is more like if it was
in fact: LJ,SP,LJ where LJ is a line-joiner, distinct also from ZWJ
(zero-width joiner) used to hint ligatures but which does not brohibit any
break.