On Wednesday, August 06, 2003 1:59 AM, Curtis Clark <[EMAIL PROTECTED]> wrote:

> on 2003-08-05 15:31 Peter Kirk wrote:
> > Thank you, Mark. This helps to clarify things, but still doesn't
> > explicitly answer my question of how to encode "a sentence like "In
> > this language the diacritic ^ may appear above the letters ...",
> > but instead of ^ I want to use a combining character"  and want to
> > display exactly one space before the combining character - do I
> > encode two spaces or one? 
> 
> In this language the diacritic  ̊ may appear above the letters...
> 
> Two spaces, at least in Thunderbird Mail.

The NFD decompositions of spacing marks is alredy defined as a SPACE
plus a non-spacing combining character. This officially documents the
usage of SPACE as a base character, and its use in combining sequences.
In the context of XML processing, where strings should (must?) be
presented in NFC form, this extra SPACE will be invisible, hidden within the
precomposed sequence, so this space does not have the line-breaking
property.

Breaking properties apply only to combining sequences, not to isolated
encoded characters. It's illegal to break in the middle of a combining
sequence. So as soon as a SPACE is followed by a combining character,
it looses its breaking properties, as those properties are only defined for
the combining sequence containing only a SPACE. So I don't think there's
any ambiguity: parsers and renderers must correctly identify combining
sequences before applying any algorithm.

This means that an algorithm like normalization of whitespace sequences
in XML or HTML should not include SPACEs that are used as base
characters in a combining sequence, and so it should keep two spaces
if the intent is to encode a logical space followed by a logical spacing
diacritic. (This is not a problem for XML which processes strings in their
NFC form).

-- 
Philippe.
Spams non tolérés: tout message non sollicité sera
rapporté à vos fournisseurs de services Internet.


Reply via email to