From: "Peter Kirk" <[EMAIL PROTECTED]> > There is some potential for real trouble here, if one process outputs an > NMTOKEN starting with a combining character preceded by a separating > space, or something else which is changed into a space, and another > process takes the new space plus combining character as a unit and so > doesn't recognise the separation. Any hackers and virus programmers > reading this will soon start flooding the Internet with tokens beginning > with combining characters in the hope of crashing implementations or > finding back doors. Of course this wouldn't have been a problem if > Unicode had never defined space plus combining character as legal and > meaningful. But this is not my problem!
I do agree: a XML document could require the use at some place of a given attribute or element. If this attribute name follows the element name after a line break, which gets changed into a space during parsing, forcing XML parsers to treat SPACE+combining as a unbreakable grapheme cluster acting like a letter would have the effect of creating a new element name which may violate the lement name identity. Now suppose that the attribute name contains a colon, you have created a custom namespace name, under which you can add any element you like, even if this was forbidden by the content-model of the reference schema. So this would invalidate existing documents, or create holes allowing insertion of arbitrary XML content, if the XML application is not validating extremely strictly the element names (the pair namespace+ name) and exclude completely from processing any unrecognized element (including all its content and attributes). This would be a breach in the content model which may have been validated and tested for security in another layer of the document encoding process (notably when XML documents are created from templates, such as XSL processors, or custom C source using simple template substitution). So for me the sequence SPACE+combining should not be acceptable as a valid grapheme cluster within element names or attribute names, and thus would need to be excluded from NMTOKEN. The correct way to do it is to consider it NOT A LETTER, but a symbol (Sk), exactly like other spacing diacritics, which are already invalid in NMTOKEN. There still remains the unresolved question of grapheme clusters that could span the starting "<" or ending ">" or "/>" of tags, or the leading "&" of a entitity reference. For this reason, defective combining sequences (combining characters without a leading base character) should be forbidden (invalid for XML). So there remains a unsolved conflict here: defective combining sequences cause security or validity problems in XML documents, and a non-defective SPACE+combining sequence cause also security problems. There's no secure choice to represent spacing diacritics which are not already encoded in a precomposed form...