Stefan Persson asked: > Alexandre Arcouteil wrote: > > Is that a clear indication that \u212B is actually a compatibility > > character and then should be, according to XML 1.1 recommandation, > > replaced by the \u00C5 character ? > > Isn't U+00C5 a compatibility character for U+0041 U+030A, > so that both should be replaced by that?
O.k., everybody, turn to p. 24 of The Unicode Standard, Version 4.0, Figure 2-8 Codespace and Encoded Characters. It is time to go to Unicode School(tm). There are 3 *abstract characters*: an uppercase A of the Latin script an uppercase Å of the Latin script a diacritic ring placed above letters in the Latin script These are potentially encodable units of textual information, derived from the orthographic universe associated with Latin script usage. They can be "found" in the world as abstractions on the basis of graphological analysis, and they exist, from the point of view of character encoding committees, a priori. They are concepts of character identity, and they don't have numbers associated with them. Next, character encoding committees get involved, because they want numbers associated with abstract characters, so that computers can process them as text. The Unicode architects noticed (they weren't the first) a generality in the Latin script regarding the productive placement of diacritics to create new letters. They determined that a sufficient encoding for these 3 abstract characters would be: U+0041 LATIN CAPITAL LETTER A U+030A COMBINING RING ACCENT with the abstract character {an uppercase Å of the Latin script} representable as a sequence of encoded characters, i.e. as <U+0041, U+030A>. But, oh ho!, they also noticed the preexistence of important character encoding standards created by other character encoding committees that represented the first two of these abstract characters as: 0x41 LATIN CAPITAL LETTER A 0xC5 LATIN CAPITAL LETTER A WITH RING ABOVE and which declined to encode the third abstract character, i.e. the diacritic ring itself. Enter Unicode Design Principles #9 Equivalent Sequences and #10 Convertibility. To get off the ground at all, the Unicode Standard simply *had* to have 1-to-1 convertibility with ISO 8859-1, as well as a large number of other standards. As a result, the UTC added the following encoded character: U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE and decreed that U+00C5 was *canonically equivalent* to the sequence <U+0041, U+030A>, thus asserting no difference in the interpretation of U+00C5 and of <U+0041, U+030A>. Now how does this related to *compatibility* characters? Well, yes, in a sense, U+00C5 is a compatibility character. It was encoded for compatibility with ISO/IEC 8859-1 (and Code Page 850, and a large number of other preexisting encoding standards and code pages). It is generally recognized as a "good" compatibility character, since it is highly useful in practice and in a sense fits within the general Unicode model for how things should be done. (This differs, for example, from the "bad" compatibility characters like U+FDC1 ARABIC LIGATURE FEH WITH MEEM WITH YEH FINAL FORM.) However, U+00C5 is not a compatibility decomposable character (or "compatibility composite" -- see definitions on p. 23 of TUS 4.0). It is, instead, a *canonical* decomposable character. (See pp. 71-72 of TUS 4.0.) Well, what about the Ångstrom sign, you may ask, since I haven't mentioned it yet? The Ångstrom sign is simply a use of the abstract character {an uppercase Å of the Latin script}, much like "g" is a gram sign and "s" is a seconds sign, and "m" is a meter sign (as well as being a sign for the prefix milli-). However, there were character encoding standards committees, predating the UTC, which did not understand this principle, and which encoded a character for the Ångstrom sign as a separate symbol. In most cases this would not be a problem, but in at least one East Asian encoding, an Ångstrom sign was encoded separately from {an uppercase Å of the Latin script}, resulting in two encodings for what really is the same thing, from a character encoding perspective. Once again, the Unicode principles of Equivalent Sequences and Convertibility came into play. The UTC encoded U+212B ANGSTROM SIGN and decreed that U+212B was *canonically equivalent* to the sequence <U+0041, U+030A>, thus asserting no difference in the interpretation of U+212B (and incidentally, also, U+00C5) and of <U+0041, U+030A>. Unlike U+00C5, however, U+212B is a "bad" compatibility character -- one that the UTC would have wished away if it could have. The sign of that badness is that its decomposition mapping in the UnicodeData.txt file is a *singleton* mapping, ie. a mapping of a single code point to another single code point, instead of to a sequence, i.e. U+212B --> U+00C5. Such singleton mappings are effectively an admission of duplication of character encoding. They are present *only* because of a roundtrip convertibility issue. To sum up so far: U+00C5 is a "good" compatibility character is a canonical decomposable character is *not* a compatibility decomposable character is canonically equivalent to <U+0041, U+030A> does not have a singleton decomposition mapping U+212B is a "bad" compatibility character is a canonical decomposable character is *not* a compatibility decomposable character is canonically equivalent to <U+0041, U+030A> does have a singleton decomposition mapping Now back to the second clause of Stefan's question: > Isn't U+00C5 a compatibility character for U+0041 U+030A, > so that both should be replaced by that? What gets replaced by what depends on the specification of normalization. (See UAX #15.) For NFD: U+00C5 and U+212B are replaced by <U+0041, U+030A>. <U+0041, U+030A> stays unchanged. For NFC: U+212B and <U+0041, U+030A> are replaced by U+00C5. U+00C5 stays unchanged. Normalization is basically completely agnostic about what is a "compatibility character", and whether precomposed forms should be used or not. One form (NFC) normalizes towards precomposed forms; one form (NFD) normalizes away from precomposed form, essentially. Note that there a also piles of "compability characters" in Unicode which have no decomposition mapping whatsoever, and which thus are completely unimpacted by normalization. Some examples: U+2FF0 IDEOGRAPHIC DESCRIPTION CHARACTER LEFT TO RIGHT (for compatibility with GBK) U+FE73 ARABIC TAIL FRAGMENT (for compatibility with some old IBM Arabic code pages) The whole block of box drawing characters, U+2500..U+257F (for compatibility with numerous old code pages) and so on. --Ken