Re: Nicest UTF

Philippe Verdy Fri, 10 Dec 2004 18:31:55 -0800

From: "John Cowan" <[EMAIL PROTECTED]>

Marcin 'Qrczak' Kowalczyk scripsit:

http://www.w3.org/TR/2000/REC-xml-20001006#charsets
implies that the appropriate level for parsing XML is code points.


You are reading the XML Recommendation incorrectly.  It is not defined
in terms of codepoints (8-bit, 16-bit, or 32-bit) but in terms of
characters.  XML processors are required to process UTF-8 and UTF-16,
and may process other character encodings or not.  But the internal
model is that of characters.  Thus surrogate code points are not
allowed.

I have different reading, because the "character" in XML is not the same as the "character" in Unicode. For XML, U+10FFFF is a valid character (even if its use is explicitly not recommanded, it is perfectly valid), for Unicode it's a non-character... For XML, U+0001 is *sometimes* a valid character, sometimes not.

And I disagree with you about the fact the U+0000 can't be used in XML documents. It can be used in URI through URI escaping mechanism, as explicitly indicated in the XML specification...

And the fact that the various character productions, that are normally normative, have been changed so often, sometimes through erratas that were forgotten in the text of the next edition of the standard, then reintroduced in an errata, shows that these productions are less reliable than the descriptive *definitions* which ARE normative in XML...

The only thing about which I can agree is that XML will forbid surrogates and U+FFFE and U+FFFF, but I won't say that a XML parser that does not reject NULs or other non-characters or "disallowed" C0 controls is so much buggy. I do think that these restrictions is a defect of XML...

But all these is also a proof that XML documents are definitely NOT plain-text documents, so you can't use Unicode encoding rules at the encoded XML document level, only at the finest plain-text nodes (these are the levels that the productions in the XML standard are trying, with more or less success, to standardize).

As a consequence any process that blindly applies a plain-text normalization to a complete XML document is bogous, because it breaks the most basic XML conformance, i.e. the core document structure...

Re: Nicest UTF

Reply via email to