Re: Nicest UTF

Philippe Verdy Thu, 09 Dec 2004 07:35:37 -0800

From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]>

Ok, so it's the conversion from raw text to escaped character
references which should treat combining characters specially.

What about < with combining acute, which doesn't have a precomposed
form? A broken opening tag or a valid text character?

Also a broken opening tag for HTML/XML documents (which are NOT plain text documents, and must be first parsed as HTML/XML, before parsing the many text sections contained in text elements, element names, attribute names, attribute values (etc...) as plain-text under the restrictions specified in the HTML or XML specifications (which contain restriction for example on which characters are allowed in names).

The XML/HTML core syntax is defined with fixed behavior of some individual characters like '&', '<', quotation marks, and with special behavior for spaces. This core structure is not plain-text, and cannot be overriden, even by Unicode grapheme clusters.

Note that HTML/XML do NOT mandate the use or even the support of Unicode, just the support of a character repertoire that contains some required characters, and the acceptance of at least the ISO/10646 repertoire under some conditions, however the encoding to code points itself is not required for something else than numeric character references, which are more symbolic in a way similar to other named character entities in SGML, than absolute as implying the required support of the repertoire with a single code!

So you can as well create fully conforming HTML or XML documents using a character set which includes characters not even defined in Unicode/ISO/IEC 10646, or characters defined only symbolically with just a name. Whever this name will map or not to one or more Unicode characters does not change the validity of the document itself.

And all the XML/HTML behavior ignores almost all Unicode properties (including normalization properties, because XML and HTML treat different strings, which are still canonically equivalent, as completely distinct; an important feature for cases like XML Signatures, where normalization of documents should not be applied blindly as it would break the data signature).

If you want to normalize XML documents, you should not do it with a normalizer working on the whole document as if it was plain-text. Instead you must normalize the individual strings that are in the XML InfoSet, as accessible when browsing the nodes of its DOM tree, and then you can serialize the normalized tree to create a new document (using CDATA sections and/or character references, if needed to escape some syntaxic characters reserved by XML that would be present in the string data of DOM tree nodes).

Note also that a XML document containing references to Unicode non-characters would still be well-formed, because these characters may be part of a non-Unicode charset.

XML document validation is a separate and optional problem from XML parsing which checks well-formedness and builds a DOM tree: validation is only performed when matching the DOM tree according to a schema definition, DTD or XSD, in which additional restrictions on allowed characters may be checked, or in which additional symbolic-only "characters" may be defined and used in the XML document with parsable named entities similar to: ">".

(An example: the schema may contain a definition for a "character" representing a private company logo, mapped to a symbolic name; the XML document can contain such references, but the DTD may also define an encoding for it in a private charset, so that the XML document will directly use that code; the Apple logo in Macintosh charsets is an example, for which an internal mapping to Unicode PUAs is not sufficient to allow correct processing of multiple XML documents, where PUAs used in each XML documents have no equivalence; the conversion of such documents to Unicode with these PUAs is a lossy conversion, not suitable for XML data processing).

Re: Nicest UTF

Reply via email to