From: "Peter Kirk" <[EMAIL PROTECTED]>

> There is some potential for real trouble here, if one process outputs
an
> NMTOKEN starting with a combining character preceded by a separating
> space, or something else which is changed into a space, and another
> process takes the new space plus combining character as a unit and so
> doesn't recognise the separation. Any hackers and virus programmers
> reading this will soon start flooding the Internet with tokens
beginning
> with combining characters in the hope of crashing implementations or
> finding back doors. Of course this wouldn't have been a problem if
> Unicode had never  defined space plus combining character as legal and
> meaningful. But this is not my problem!

I do agree: a XML document could require the use at some place of a
given attribute or element. If this attribute name follows the element
name
after a line break, which gets changed into a space during parsing,
forcing
XML parsers to treat SPACE+combining as a unbreakable grapheme
cluster acting like a letter would have the effect of creating a new
element
name which may violate the lement name identity. Now suppose that the
attribute name contains a colon, you have created a custom namespace
name, under which you can add any element you like, even if this was
forbidden by the content-model of the reference schema.

So this would invalidate existing documents, or create holes allowing
insertion of arbitrary XML content, if the XML application is not
validating extremely strictly the element names (the pair namespace+
name) and exclude completely from processing any unrecognized
element (including all its content and attributes). This would be a
breach in the content model which may have been validated and tested
for security in another layer of the document encoding process (notably
when XML documents are created from templates, such as XSL
processors, or custom C source using simple template substitution).

So for me the sequence SPACE+combining should not be acceptable
as a valid grapheme cluster within element names or attribute names,
and thus would need to be excluded from NMTOKEN. The correct
way to do it is to consider it NOT A LETTER, but a symbol (Sk),
exactly like other spacing diacritics, which are already invalid in
NMTOKEN.

There still remains the unresolved question of grapheme clusters
that could span the starting "<" or ending ">" or "/>" of tags, or
the leading "&" of a entitity reference. For this reason, defective
combining sequences (combining characters without a leading base
character) should be forbidden (invalid for XML).

So there remains a unsolved conflict here: defective combining
sequences cause security or validity problems in XML documents,
and a non-defective SPACE+combining sequence cause also
security problems. There's no secure choice to represent
spacing diacritics which are not already encoded in a precomposed
form...


Reply via email to