W3C Character Model (was Re: Unicode Search Engines)

David Hopwood Thu, 21 Feb 2002 16:53:58 -0800

-----BEGIN PGP SIGNED MESSAGE-----

Mark Davis wrote:
> 
> > > Documents not in UTF-* are normalized by definition, unless it is
> > > *impossible* to convert them to normalized Unicode (typically
> > > because they contain characters not yet present in Unicode).
> 
[...]
> Simply saying that a document is "normalized by definition" if it is
> *possible* to convert it to Unicode would ignore reality, since
> converters may not *actually* convert it to normalized Unicode. One
> would have to have the additional requirement in the Character Model,
> that any XML parser that converts an XML document from a legacy
> character set into Unicode is not conformant unless it is (actually)
> normalizing.


That requirement is already in the Character Model:

<http://www.w3.org/TR/2002/WD-charmod-20020220/>

# 4.2.2 Include-normalized Text
[...]
# Text data is include-normalized if:
#
# 1. the data is Unicode-normalized and does not contain any character
#    escapes or includes whose expansion would cause the data to become
#    no longer Unicode-normalized; or
#
# 2. the data is in a legacy encoding and, if it were transcoded to a
#    Unicode encoding form by a normalizing transcoder, the resulting
#    data would satisfy clause 1 above.
#
# NOTE: A consequence of this definition is that legacy text (i.e. text
# in a legacy encoding) is always include-normalized unless i) a
# normalizing transcoder cannot exist for that encoding (e.g. because
# the repertoire contains characters not in Unicode) or ii) the text
# contains escapes or includes which, once expanded, result in
# un-normalized text.
[...]
# 4.2.3 Fully Normalized Text
[...]
# Text data is fully normalized if it is include-normalized and none of
# the spans composing the text begin with a non-starter character.
#
# In the remainder of this specification, normalized is used to mean
# 'fully normalized', unless otherwise indicated.
[...]
# 4.3 Responsibility for Normalization
[...]
# [C] All text content on the Web MUST be in include-normalized form and
# SHOULD be in fully normalized form.
#
# [S] Specifications of text-based formats and protocols MUST, as part of
# their syntax definition, require that the text be in normalized form.
[...]
# [I] Implementations which transcode text data from a legacy encoding
# to a Unicode encoding form MUST use a normalizing transcoder.


I don't think that implicitly redefining 'normalized' as 'fully normalized'
in most of the document is a good idea - it should be spelt out explicitly.
Also, 'fully normalized' doesn't appear to be defined correctly for legacy
charsets; it should be defined like this:

  1. the data is Unicode-normalized, does not contain any character
     escapes or includes whose expansion would cause the data to become
     no longer Unicode-normalized, and none of the spans composing the
     text begin with a non-starter character; or

  2. the data is in a legacy encoding and, if it were transcoded to a
     Unicode encoding form by a normalizing transcoder, the resulting
     data would satisfy clause 1 above.

I'll have to submit some comments about this.

- -- 
David Hopwood <[EMAIL PROTECTED]>

Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/
RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5  0F 69 8C D4 FA 66 15 01
Nothing in this message is intended to be legally binding. If I revoke a
public key but refuse to specify why, it is because the private key has been
seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip


-----BEGIN PGP SIGNATURE-----
Version: 2.6.3i
Charset: noconv

iQEVAwUBPHMdxzkCAxeYt5gVAQFl6Af+KJDkFbihALZ5KI9AXTVxxJvI5kwZjaT3
M3iiWQoo1eLoRSbjkLJdC0odr3XIxS4FRlrqL842ZwyRM6iRizUyoRqa0LWLzcjv
SOCVywFxuHRR723IPgePjrgNIKSbLRTjVt3m20mHTjncN9MdOV28EiBi1IVcr92h
TKzp/UkEkS7lyzUYV+dIV6X8WflE2ej/Wwpkshyu8pFOtP5mTPqYg2aZw5JX4oSK
Rx0CMmtRek3mxNZ/vVHOM3VZVGhxS5LjH8okwtInFcQ6MJBPXKbt7Zw/sKVnbbMc
2BNxI+cmIikti6sUgy34MJscygLRXYSxNb/t0Q7NuAbMRNwsG5QkWw==
=56c2
-----END PGP SIGNATURE-----

W3C Character Model (was Re: Unicode Search Engines)

Reply via email to