Juliusz wrote: > Spacing diacritical marks (e.g. U+00A8) have compatibility > decompositions of the form 0020 xxxx. Why are these not canonical > decompositions?
Unicode legacy, and the involvement of the SPACE. This stuff goes back to Unicode 1.1, which attempted to codify the Unicode recommendation for how to display accents in isolation (by applying them to SPACE characters) by putting "[0020]&[0301]", etc. decompositions in the character list. In Unicode 2.0, the UTC had to make a decision as to whether these were canonical or compatibility decompositions, and decided on compatibility basically for two reasons: 1. the spacing diacritics were considered compatibility characters (even the ASCII ones), and 2. application of a non-spacing accent onto a SPACE wasn't guaranteed to produce exactly the same rendition as use of an atomic non-spacing form, particularly for the ASCII spacing accents, so it seemed inadvisable to assert that these were canonical equivalences. > Under what circumstances would you expect the spacing > marks to behave differently from their decompositions? Well, for example, when used as, or confused with, modifier letters. > > The two that are in ASCII don't decompose. Is that because they're > overloaded? John Cowan provided the answer here: normalization considerations. > > A number of combining characters (e.g. U+0340, U+0341, U+0343) have > canonical equivalents, i.e. canonical decompositions that are a single > character. In other words, we have pairs of codepoints that are bound > to behave in exactly the same manner under all circumstances. What's > the deal? Assertions by the UTC that these are effectively duplicate characters. U+0340 and U+0341 were mistakes made when Vietnamese was being considered for encoding. U+0343 is a result of the Greek committee asserting that Koronis and Psili are distinct (which is how they got into 10646), but the UTC assertion is that they are the same *character*, although with slightly different functions (as for different functions of other accent marks and diacritics in Latin). "Über das Zeichen der Mischung wird ein Häkchen, korooni's, gesetzt. Es sieht aus wie ein spiritus lenis, ist aber keiner, denn dieser kann nur am Anfang eines Wortes stehen. Die Koronis zeigt nur an, daß eine Krasis stattgefinden hat." "62. Crasis (kraasis _mingling_) is the contraction of a vowel or diphthong at the end of a word with a vowel or diphthong beginning the following word. Over the syllable resulting from contraction is placed a {{apostrophe glyph}} called coroonis (hook), as ... [examples follow]" -- Smythe. So you can get a psili (smooth breathing, spiritus lenis) at the beginning of a vowel-initial Greek word in polytoniko, but a koronis in the middle of a word, over the vowel where crasis has occurred. In any case, the hook itself appears identical, and there is no character encoding reason to distinguish them. Hence the canonical decomposition of U+0343. > Unicode contains a number of precomposed spacing diacritical marks for > Greek (e.g. U+1FC1). However, and unless I've missed something, with > the exception of U+0385, they do not have combining (non-spacing) > versions. What's the rationale here? They have combining versions of each of the pieces, for the composite ones. There is no good reason to invent composite combining marks involving two accents together. (In fact, there are good reasons *not* to do so.) The few that exist, e.g. U+0344, cause implementation problems and are discouraged from use. > > (Similar precomposed diacritical marks do not seem to exist for > Vietnamese, which makes me think they've been included for > compatibility with legacy encodings rather than for a good reason. The Greek precomposed spacing accents came in as a lump from the Greek national body ELOT for polytoniko Greek, and had to be accepted into Unicode as part of the merger compromise with 10646. > Still, because their decompositions are not canonical, they need to be > taken into account, which in my case complicates what would otherwise > be somewhat cleaner code.) True enough. The UTC disliked them from the beginning, but had no choice in the matter. > When rendering stacked combining characters (i.e. sequences of > combining characters with the same non-zero combining class), which > sequences need to be treated specially (as opposed to being stacked on > top of each other)? I already know about the pairs needed for Greek > (both Mono- and Polytonic) and Vietnamese. I don't know of any other regular, language-specific exceptions. But you can expect to occasionally run into typographically-based exceptional behavior whenever an orthography results in a requirement to stack diacritics top or bottom. > > As far as I can tell, there is nothing in the Unicode database that > relates a ``modifier letter'' to the associated punctuation mark. Is > that right? Correct. They are viewed as distinct classes. > Does anyone have such data that I could steal? > (Hopefully with no legal strings attached.) > > On a related note, does anyone has a map from mathematical characters > to the Geometric Shapes, Misc. symbols and Dingbats that would be > useful for rendering? As opposed to the characters themselves? I'm not sure what you are getting at here. An example would perhaps help. --Ken