On Wednesday, August 06, 2003 1:59 AM, Curtis Clark <[EMAIL PROTECTED]> wrote:
> on 2003-08-05 15:31 Peter Kirk wrote: > > Thank you, Mark. This helps to clarify things, but still doesn't > > explicitly answer my question of how to encode "a sentence like "In > > this language the diacritic ^ may appear above the letters ...", > > but instead of ^ I want to use a combining character" and want to > > display exactly one space before the combining character - do I > > encode two spaces or one? > > In this language the diacritic ̊ may appear above the letters... > > Two spaces, at least in Thunderbird Mail. The NFD decompositions of spacing marks is alredy defined as a SPACE plus a non-spacing combining character. This officially documents the usage of SPACE as a base character, and its use in combining sequences. In the context of XML processing, where strings should (must?) be presented in NFC form, this extra SPACE will be invisible, hidden within the precomposed sequence, so this space does not have the line-breaking property. Breaking properties apply only to combining sequences, not to isolated encoded characters. It's illegal to break in the middle of a combining sequence. So as soon as a SPACE is followed by a combining character, it looses its breaking properties, as those properties are only defined for the combining sequence containing only a SPACE. So I don't think there's any ambiguity: parsers and renderers must correctly identify combining sequences before applying any algorithm. This means that an algorithm like normalization of whitespace sequences in XML or HTML should not include SPACEs that are used as base characters in a combining sequence, and so it should keep two spaces if the intent is to encode a logical space followed by a logical spacing diacritic. (This is not a problem for XML which processes strings in their NFC form). -- Philippe. Spams non tolérés: tout message non sollicité sera rapporté à vos fournisseurs de services Internet.