On Sunday, August 10, 2003 9:17 PM, Peter Kirk <[EMAIL PROTECTED]> wrote:

> On 10/08/2003 10:09, Michael Everson wrote:
> 
> > At 01:30 +0200 2003-08-10, Philippe Verdy wrote:
> > 
> > > Whateer you think, the SPACE+diacritic is still a hack, and
> > > certainly not a canonical equivalent (including for its
> > > properties), of the existing spacing diacritics, which also do
> > > not fit all usages because they are symbols.
> > 
> > 
> > It is the formally specified way to represent what you say you want
> > to represent. If an implementation doesn't do that nicely enough,
> > complain to the implementors. (This has already been suggested to
> > you.) 


Example of problem with SPACE+diacritics in UAX#29:

- Grapheme clusters:
"One or more Unicode characters may make up what the user thinks of as a character or 
basic unit of the language. To avoid ambiguity with the computer use of the term 
character, this is called a grapheme cluster. For example, “G” + acute-accent is a 
grapheme cluster: it is thought of as a single character by users, yet is actually 
represented by two Unicode code points. For more information on the ambiguity in the 
term character, see UTR #17: Character Encoding Model
(...)
Grapheme clusters commonly commonly behave as units in terms of mouse selection, arrow 
key movement, backspacing, and so on. When this is done, for example, and an accented 
character is represented by a combining character sequence, then using the right arrow 
key would skip from the start of the base character to the end of the last combining 
character."

So combining sequences like SPACE+diacritics are grapheme clusters.

- Word boundaries:
"(rule 3) Treat a grapheme cluster as if it were a single character: the first 
character of the cluster.
     GC → FC"

This seems to be the only rule that is appropriate to relate to combining sequences 
and combining characters, which are ignored otherwise in the other rules. So 
SPACE+diacritics is handled like SPACE.

- Sentence boundaries:
"(rule 4) Treat a grapheme cluster as if it were a single character: the first 
character of the cluster.
     GC → FC"

Same problem.

- " 6.1 Normalization
Although boundaries are specified in terms of NFD text, in practice normalization is 
not required. The Grapheme Cluster specification has a number of features to are to 
ensure that the same results are returned for canonically equivalent text. It will not 
break within a sequence of non-spacing marks, which is the only part that can reorder 
in the formation of NFD. Nor is there ever a break between a base character and 
subsequent non-spacing marks. It also has a special set of characters marked as having 
the Extend property value, such as U+09BE ( ◌া ) BENGALI VOWEL SIGN AA, to deal 
with particular compositions.
The other default boundary specifications never break within grapheme clusters, and 
always use a consistent property value for each grapheme cluster as a whole."
This just specifies that there will be no break between the base character SPACE and 
its diacritics, but says nothing about possible breaks after or before the combining 
sequence.
- "6.2 Grapheme Cluster and Format Rules
The first rule for the default word and sentence specifications is to treat a grapheme 
cluster as a single character: the first character of the cluster. This would be 
equivalent to making the following changes to the subsequent rules.
(...)
Insert Extend* after every boundary property value — except after the final property 
after the break symbol.
Thus  X Y × Z W  becomes  X Extend* Y Extend* × Z Extend* W .
Thus  X Y ×  becomes  X Extend* Y Extend* ×"

So rules like "X SPACE ×  Z" become "X Extend* SPACE Extend* ×  Z", whose one 
instance is "X SPACE diacritics ×  Z"

This is also confirmed by the fact that normalization is explicitly NOT required to 
process text boundaries, which is exactly the place where the use of SPACE causes the 
most important problems for text processing and rendering.

---

Similar problems occur with UAX#14 for Line breaking, which forgot the case of 
SPACE+diacritics handled there as if it were the first character of the sequence. What 
is worse is this description:

"SP - Space (A) - (normative)
 0020 SPACE (SP)
The space characters are explicit break opportunities, but spaces at the end of a line 
are not measured for fit. If there is a sequence of space characters, and breaking 
after any of the space characters would result in the same visible line, the line 
breaking position after the last space character in the sequence is the locally most 
optimal one. In other words, since the last character measured for fit is before the 
space character, any number of space characters are kept together invisibly on the 
previous line and the first non-space character starts the next line. NOTE: SPACE, but 
none of the other breaking spaces, is used in determining an indirect break."

This statement clearly ignores the existence of SPACE+diacritics... Same thing for:

"ZW - Zero Width Space (A) - (normative)
 200B ZERO WIDTH SPACE (ZWSP)
This character does not have width. It is used to enable additional (invisible) break 
opportunities wherever SPACE cannot be used."

This shows that ZWSP+diacritics would not work either for Hebrew isolated diacritics 
(with missing implied letter).

Note that these two ZW and SP classes of characters are *normative*. Another proof 
that SPACE+diacritics is really a hack causing lots of problems in the Unicode main 
standard and its standard annexes.

Now similar problems also exist in UAX#9 (the BiDi algorithm), which also describes 
problematic normative properties like the neutrality of the SPACE character for mixed 
directionality: where would the SPACE+diacritics be displayed if there's a 
directionality change on either side of this combining sequence? such problem does not 
occur with existing spacing diacritics handled regularly like symbols:

"3.3.3. Resolving Weak Types
Weak types are now resolved one level run at a time. At level run boundaries where the 
type of the character on the other side of the boundary is required, the type assigned 
to sor or eor is used.
Non-spacing marks are now resolved based on the previous characters.
W1. Examine each non-spacing mark (NSM) in the level run, and change the type of the 
NSM to the type of the previous character. If the NSM is at the start of the level 
run, it will get the type of sor.
Assume in this example that sor is R:
  AL  NSM NSM => AL  AL  AL
  sor NSM     => sor R"

Nothing is said elsewhere about diacritics, but here SPACE does not match the "AL" 
linebreaking category. So the representation is still undefined here...

"L3. Combining marks applied to a right-to-left base character will at this point 
precede their base character. If the rendering engine expects them to follow the base 
characters in the final display process, then the ordering of the marks and the base 
character must be reversed."

As SPACE is directionality neutral, the diacritic applied on it will be also 
directionality neutral, and will inherit the direction of the previous grapheme 
cluster. Other areas in UAX#9 covering joiners/disjoiners also will cause problems: 
how can we join/disjoin a spacing diacritic if it is encoded with a SPACE base 
character plus combining diacritics?

Will I need to say more about this SPACE+diacritics legacy hack, and the many problems 
or non interoperable solutions offered by various implementations to solve this 
problem?

-- 
Philippe.
Spams non tolérés: tout message non sollicité sera
rapporté à vos fournisseurs de services Internet.

Reply via email to