-----BEGIN PGP SIGNED MESSAGE----- Mark Davis wrote: > > > > Documents not in UTF-* are normalized by definition, unless it is > > > *impossible* to convert them to normalized Unicode (typically > > > because they contain characters not yet present in Unicode). > [...] > Simply saying that a document is "normalized by definition" if it is > *possible* to convert it to Unicode would ignore reality, since > converters may not *actually* convert it to normalized Unicode. One > would have to have the additional requirement in the Character Model, > that any XML parser that converts an XML document from a legacy > character set into Unicode is not conformant unless it is (actually) > normalizing.
That requirement is already in the Character Model: <http://www.w3.org/TR/2002/WD-charmod-20020220/> # 4.2.2 Include-normalized Text [...] # Text data is include-normalized if: # # 1. the data is Unicode-normalized and does not contain any character # escapes or includes whose expansion would cause the data to become # no longer Unicode-normalized; or # # 2. the data is in a legacy encoding and, if it were transcoded to a # Unicode encoding form by a normalizing transcoder, the resulting # data would satisfy clause 1 above. # # NOTE: A consequence of this definition is that legacy text (i.e. text # in a legacy encoding) is always include-normalized unless i) a # normalizing transcoder cannot exist for that encoding (e.g. because # the repertoire contains characters not in Unicode) or ii) the text # contains escapes or includes which, once expanded, result in # un-normalized text. [...] # 4.2.3 Fully Normalized Text [...] # Text data is fully normalized if it is include-normalized and none of # the spans composing the text begin with a non-starter character. # # In the remainder of this specification, normalized is used to mean # 'fully normalized', unless otherwise indicated. [...] # 4.3 Responsibility for Normalization [...] # [C] All text content on the Web MUST be in include-normalized form and # SHOULD be in fully normalized form. # # [S] Specifications of text-based formats and protocols MUST, as part of # their syntax definition, require that the text be in normalized form. [...] # [I] Implementations which transcode text data from a legacy encoding # to a Unicode encoding form MUST use a normalizing transcoder. I don't think that implicitly redefining 'normalized' as 'fully normalized' in most of the document is a good idea - it should be spelt out explicitly. Also, 'fully normalized' doesn't appear to be defined correctly for legacy charsets; it should be defined like this: 1. the data is Unicode-normalized, does not contain any character escapes or includes whose expansion would cause the data to become no longer Unicode-normalized, and none of the spans composing the text begin with a non-starter character; or 2. the data is in a legacy encoding and, if it were transcoded to a Unicode encoding form by a normalizing transcoder, the resulting data would satisfy clause 1 above. I'll have to submit some comments about this. - -- David Hopwood <[EMAIL PROTECTED]> Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/ RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5 0F 69 8C D4 FA 66 15 01 Nothing in this message is intended to be legally binding. If I revoke a public key but refuse to specify why, it is because the private key has been seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip -----BEGIN PGP SIGNATURE----- Version: 2.6.3i Charset: noconv iQEVAwUBPHMdxzkCAxeYt5gVAQFl6Af+KJDkFbihALZ5KI9AXTVxxJvI5kwZjaT3 M3iiWQoo1eLoRSbjkLJdC0odr3XIxS4FRlrqL842ZwyRM6iRizUyoRqa0LWLzcjv SOCVywFxuHRR723IPgePjrgNIKSbLRTjVt3m20mHTjncN9MdOV28EiBi1IVcr92h TKzp/UkEkS7lyzUYV+dIV6X8WflE2ej/Wwpkshyu8pFOtP5mTPqYg2aZw5JX4oSK Rx0CMmtRek3mxNZ/vVHOM3VZVGhxS5LjH8okwtInFcQ6MJBPXKbt7Zw/sKVnbbMc 2BNxI+cmIikti6sUgy34MJscygLRXYSxNb/t0Q7NuAbMRNwsG5QkWw== =56c2 -----END PGP SIGNATURE-----