Peter Kirk posted:

If I want to do this, should I explicitly encode a dotted circle, or
should I encode nothing and expect the font to generate the dotted
circle, as it often does?

I think that practise of a font or application automaticaly inserting a dotted circle under an orphaned combining character is dubious compliant with Unicode specifications.


In http://www.unicode.org/book/preview/ch03.pdf the space characters in general are given class Zs:

<< Zs, Zl, and Zp are considered format characters, but their membership in the Z (separator) class takes precedence over their membership in the Cf class, because the General Category assigns only a single value to each character. >>

So the various space characters (class Zs) are also classified as format characters.

From http://www.unicode.org/book/ch04.pdf:

<< _D13 Base character:_ a character that does not graphically combine with preceding character, and that is neither control nor a format character. >>

Accordingly, by definition, spaces are not base characters.

Also from http://www.unicode.org/book/ch04.pdf:

<< _D14 Combining character:_ a character that graphically combines with a preceding base character. The combining character is said to _apply_ to the base character. >>

So we know what happens with a combining character follows a base character. It combines with it.

What happens when a combining character follows a character that is not a base character or appears initially? The same source explains:

<< o Even though a combining character is intended to be presented in graphical combination with a base character, circumstances may arise where either (1) no base character precedes the combining character or (2) a process is unable to perform graphical combination. In both cases it may present a combining character without graphical combination; that is, it may present it as if it were a base character.

o The representative images of combining characters are depicted with a dotted circle in the code charts; when presented in a graphical combination with a preceding base character, that base character is intended to appear in the position occupied by the dotted circle. >>

So a display device *may* present an oprhaned combining character as suggested.

But the word "may" is weak. Or there other things it may do that would still be compliant with Unicode? May it ignore the character altogether? May it display the character as U+FFFD REPLACEMENT CHARACTER? May it display the over some other character altogether, perhaps even U+20CC DOTTED CIRCLE? This is the only way I can to justify the display of U+20CC DOTTED CIRCLE in such cases by the Unicode specifications.

But is then is there any display that is not acceptable according to these specifications?

Note that even if an application takes the suggestion made here, the combination of the non-base character SPACE followed by a combining character would be rendered as the non-base character SPACE followed by the combining character rendered as a base character. They would not be combined.

From the same source:

<< _D17a Defective combining character sequence:- a combining character sequence that does not start with a base character.

o Defective combining character sequences occur when a sequence of combining charactes appears at the start of a string or follows a control or format character. Such sequences are defective from the point of handling of combining marks, but are not _ill-formed_. (See D30.)

Accordingly any space character followed by a combining character is a defective combining character sequence.

From http://unicode.org/book/ch07.pdf

<< *Marks as Spacing Characters.* By convention, combining marks may be exhibited in (apparent) isolation by applying them to U+0020 SPACE or to U+00A0 NO-BREAK SPACE. This approach might be taken, for example, when referring to the diacritical mark itself as a mark, rather than by using it in its normal way in text. The use of U+0020 SPACE versus U+00A0 NO-BREAK SPACE affects line-break behavior.>>

The words "by convention" are odd. It perhaps acknowledges that this shouldn't work according to general other Unicode rules and definitions.

This passage, however, does not even hint that "by convention" a dotted circle should appear under the diacritic.

Presumably if someone wanted a combining character applied to a dotted circle that person would code U+20CC followed by the combining character.

One could fix this messiness by changing the definition of base character to specifically include U+0020 SPACE and U+00A0 NO-BREAK SPACE. That in effect is exactly what the above passage does. So it in a structured manner by making it part of the rule instead burying it in the text an odd exception to the rule.

But it does seems philosphically odd that U+0020 and U+00A0 alone of the category Zs characters should be especially singled out.

It would be more intuitive if all Zs characters could be included in the category of base characters. Is there any philosphical reason why combining characters should not be applied to the other spaces?

The combining character might of course increase the width of the space:

Again from http://www.unicode.org/book/ch04.pdf:

<< o Such characters may be large enough to effect the placement of their base character relative to preceding and succeeding base characters. For example, a circumflex applied to an "i" may effect spacing ("î"), as might the character U+20DD COMBINING ENCLOSED CIRCLE. >>

In any case, I see nothing in the Unicode specifications that suggests replacing either U+0020 or U+00A0 by U+20CC when followed by a combining character or placing applying the combining character to any inserted U+20CC when it is part of a defective combining character sequence.

Jim Allan



_D15 Nonspacing mark: a combining character whose positioning in presentation is dependent on the base character. It generally does not
consume space along the visual baseline and and of itself.


o Such characters may be large enough to effect the placement of their base character relative to preceding and succeeding base characters. For example, a circumflex applied to an "i" may effect spacing ("î"), as might the character U+20DD COMBINING ENCLOSED CIRCLE.





Reply via email to