----- Original Message ----- 
From: "Alexandre Arcouteil" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Friday, November 14, 2003 10:41 AM
Subject: compatibility characters (in XML context)


> This is a beginner question :
>
> In the XML 1.1 Proposed Recommendation 05 November 2003
> (http://www.w3.org/TR/xml11), it is said that "Document authors are
> encouraged to avoid "compatibility characters", as defined in section
> 6.8 of [Unicode]" so relating to Unicode 2.0.
>
> I don't see any online documentation about explicit definition of
> "compatibility characters" according to 2.0.

Compatibility characters can be defined as the characters whose canonical
decomposition mapping is either::

    (1) a singleton (example the AngstrÃm symbol, canonically mapped to A
with diaeresis, or the list of unified Han ideographs, only included for
compatibility with legacy charsets or because of assignment errors in
Unicode 1.0) and that are implicitly restricted from being recomposed in all
NF* forms, or

    (2) two-code _canonical_ decomposition mapping, but are excluded from
canonical composition (example the hebrew shin letter with shin dot).

These characters will never be part of any string in a normalized form (NFC,
NFD, NFKC, NFKD).

> At least I'd like to know if characters like "Ã" "Ã" or "Å" are
> concerned.

No.: "Ã" and "Ã" have canonical decompositions, but are not excluded from
recomposition.
And the "oe ligature" has only a compatiblity decomposition, and then is not
a compatibility character.

> Is somewhere a complete chart of "compatibility characters" ?


Look at the Unicode data file which lists composition exclusions...


Reply via email to