> From: "Jon Hanna" <[EMAIL PROTECTED]> > > Some of these only apply to elements that do not allow any > > character data apart from whitespace to appear directly within them, and > > hence are not an issue here. Some happen at relatively high level of > > processing, e.g. rendering (not parsing) of HTML, and as such should > > correctly process spaces combined with combining characters. > > Here I have to disagree: in XML, the normalization of whitespaces occurs > during parsing before the DOM tree is built, and so the initial > whitespaces > are made inaccessible; rendering occurs only later based on the parsed > DOM tree. This is to ensure the equivalence of the encoding under very > strict conditions defined in the XML standard (and retrofitted now in the > HTML standard to mimic the standard practices of HTML 4.01 in > XHTML 1.0 (and now 1.1 with the XHTML modularization).
Lots of different things happen that affect the whitespace of an XML document (whether a DOM tree is constructed or not, since it isn't the only legal way to process an XML document). Of course rendering can do something further to parsing with whitespace. Rendering can do whatever the rendering engine wants to do, it isn't defined by XML. When an application receives U+0020, U+0020, U+0302, U+0020 then it should probably (unless there are good application-specific reasons why not) treat that more or less the same as if it had received U+0020, U+005E, U+0020 (if there are minor glyph differences fair enough). This isn't a matter of XML's whitespace rules, but it is a matter of how what we are discussing affects XML-based technology as a whole. Further it is completely true that some of the rules only affect elements that only allow element content. > Strict conformance for the behavior of these whitespaces is mandatory and > cannot be bypassed or negociated, Well if a non-validating parser hasn't seen a declaration for an attribute of type NMTOKENS it would treat it as being of type CDATA which would alter how whitespace was treated. However that is mostly correct, it just isn't a problem except if someone attempts to use the sequence {space, combining char} in a name or nmtoken, which as I said would be a pretty bizarre design decision anyway. notably when XML data needs to be > certified against alteration, i.e. cryptographically signed. (XML > signature > is now standardized), or when the DOM tree is used and altered in a > predictable way with technologies like XPath which needs to refer to > exact encoding position in the encoded Unicode NFC form of text elements, > attribute values, or CDATA sections. Yep, Yep, Yep. Still doesn't mean there is any problem.