Re: compatibility characters (in XML context)

Kenneth Whistler Fri, 14 Nov 2003 17:15:53 -0800

Stefan Persson asked:

> Alexandre Arcouteil wrote:
> > Is that a clear indication that \u212B is actually a compatibility 
> > character and then should be, according to XML 1.1 recommandation, 
> > replaced by the \u00C5 character ?
> 
> Isn't U+00C5 a compatibility character for U+0041 U+030A, 
> so that both should be replaced by that?


O.k., everybody, turn to p. 24 of The Unicode Standard, Version 4.0,
Figure 2-8 Codespace and Encoded Characters. It is time to go
to Unicode School(tm).

There are 3 *abstract characters*:

   an uppercase A of the Latin script
   
   an uppercase Å of the Latin script
   
   a diacritic ring placed above letters in the Latin script
   
These are potentially encodable units of textual information,
derived from the orthographic universe associated with Latin
script usage. They can be "found" in the world as abstractions
on the basis of graphological analysis, and they exist, from the
point of view of character encoding committees, a priori.
They are concepts of character identity, and they don't have
numbers associated with them.


Next, character encoding committees get involved, because they
want numbers associated with abstract characters, so
that computers can process them as text.

The Unicode architects noticed (they weren't the first) a
generality in the Latin script regarding the productive placement
of diacritics to create new letters. They determined that a
sufficient encoding for these 3 abstract characters would be:

U+0041 LATIN CAPITAL LETTER A
U+030A COMBINING RING ACCENT

with the abstract character {an uppercase Å of the Latin script}
representable as a sequence of encoded characters, i.e. as
<U+0041, U+030A>.

But, oh ho!, they also noticed the preexistence of important
character encoding standards created by other character encoding
committees that represented the first two of these abstract
characters as:

0x41 LATIN CAPITAL LETTER A
0xC5 LATIN CAPITAL LETTER A WITH RING ABOVE

and which declined to encode the third abstract character, i.e. the
diacritic ring itself.

Enter Unicode Design Principles #9 Equivalent Sequences and
#10 Convertibility. To get off the ground at all, the
Unicode Standard simply *had* to have 1-to-1 convertibility
with ISO 8859-1, as well as a large number of other standards.
As a result, the UTC added the following encoded character:

U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE

and decreed that U+00C5 was *canonically equivalent* to the
sequence <U+0041, U+030A>, thus asserting no difference in
the interpretation of U+00C5 and of <U+0041, U+030A>.

Now how does this related to *compatibility* characters?
Well, yes, in a sense, U+00C5 is a compatibility character.
It was encoded for compatibility with ISO/IEC 8859-1 (and
Code Page 850, and a large number of other preexisting
encoding standards and code pages). It is generally
recognized as a "good" compatibility character, since it is
highly useful in practice and in a sense fits within the
general Unicode model for how things should be done. (This
differs, for example, from the "bad" compatibility characters
like U+FDC1 ARABIC LIGATURE FEH WITH MEEM WITH YEH FINAL FORM.)

However, U+00C5 is not a compatibility decomposable character
(or "compatibility composite" -- see definitions on p. 23 of
TUS 4.0). It is, instead, a *canonical* decomposable character.
(See pp. 71-72 of TUS 4.0.)

Well, what about the Ångstrom sign, you may ask, since I
haven't mentioned it yet? The Ångstrom sign is simply
a use of the abstract character {an uppercase Å of the Latin script},
much like "g" is a gram sign and "s" is a seconds sign, and
"m" is a meter sign (as well as being a sign for the prefix
milli-).

However, there were character encoding standards committees,
predating the UTC, which did not understand this principle,
and which encoded a character for the Ångstrom sign as a
separate symbol. In most cases this would not be a problem,
but in at least one East Asian encoding, an Ångstrom sign
was encoded separately from {an uppercase Å of the Latin script},
resulting in two encodings for what really is the same thing,
from a character encoding perspective.

Once again, the Unicode principles of Equivalent Sequences
and Convertibility came into play. The UTC encoded

U+212B ANGSTROM SIGN

and decreed that U+212B was *canonically equivalent* to the
sequence <U+0041, U+030A>, thus asserting no difference in
the interpretation of U+212B (and incidentally, also,
U+00C5) and of <U+0041, U+030A>.

Unlike U+00C5, however, U+212B is a "bad" compatibility 
character -- one that the UTC would have wished away if it
could have. The sign of that badness is that its
decomposition mapping in the UnicodeData.txt file is a
*singleton* mapping, ie. a mapping of a single code point
to another single code point, instead of to a sequence,
i.e. U+212B --> U+00C5. Such singleton mappings are
effectively an admission of duplication of character
encoding. They are present *only* because of a roundtrip
convertibility issue.

To sum up so far:

U+00C5
   is a "good" compatibility character
   is a canonical decomposable character
   is *not* a compatibility decomposable character
   is canonically equivalent to <U+0041, U+030A>
   does not have a singleton decomposition mapping
   
U+212B
   is a "bad" compatibility character
   is a canonical decomposable character
   is *not* a compatibility decomposable character
   is canonically equivalent to <U+0041, U+030A>
   does have a singleton decomposition mapping

Now back to the second clause of Stefan's question:

> Isn't U+00C5 a compatibility character for U+0041 U+030A, 
> so that both should be replaced by that?

What gets replaced by what depends on the specification
of normalization. (See UAX #15.)

For NFD:

   U+00C5 and U+212B are replaced by <U+0041, U+030A>.
   
   <U+0041, U+030A> stays unchanged.
   
For NFC:

   U+212B and <U+0041, U+030A> are replaced by U+00C5.
   
   U+00C5 stays unchanged.
   
Normalization is basically completely agnostic about what
is a "compatibility character", and whether precomposed
forms should be used or not. One form (NFC) normalizes
towards precomposed forms; one form (NFD) normalizes
away from precomposed form, essentially.

Note that there a also piles of "compability characters" in
Unicode which have no decomposition mapping whatsoever,
and which thus are completely unimpacted by normalization.
Some examples:

U+2FF0 IDEOGRAPHIC DESCRIPTION CHARACTER LEFT TO RIGHT

(for compatibility with GBK)

U+FE73 ARABIC TAIL FRAGMENT

(for compatibility with some old IBM Arabic code pages)

The whole block of box drawing characters, U+2500..U+257F

(for compatibility with numerous old code pages)

and so on.

--Ken

Re: compatibility characters (in XML context)

Reply via email to