Re: A few questions about decomposition, equvalence and rendering

Kenneth Whistler Tue, 05 Feb 2002 12:42:12 -0800

Juliusz wrote:

> Spacing diacritical marks (e.g. U+00A8) have compatibility
> decompositions of the form 0020 xxxx.  Why are these not canonical
> decompositions?


Unicode legacy, and the involvement of the SPACE. This stuff goes
back to Unicode 1.1, which attempted to codify the Unicode
recommendation for how to display accents in isolation (by applying
them to SPACE characters) by putting "[0020]&[0301]", etc. decompositions
in the character list. In Unicode 2.0, the UTC had to make a decision
as to whether these were canonical or compatibility decompositions,
and decided on compatibility basically for two reasons: 1. the
spacing diacritics were considered compatibility characters (even
the ASCII ones), and 2. application of a non-spacing accent onto
a SPACE wasn't guaranteed to produce exactly the same rendition
as use of an atomic non-spacing form, particularly for the ASCII
spacing accents, so it seemed inadvisable to assert that these were
canonical equivalences.

> Under what circumstances would you expect the spacing
> marks to behave differently from their decompositions?

Well, for example, when used as, or confused with, modifier
letters.

> 
> The two that are in ASCII don't decompose.  Is that because they're
> overloaded?

John Cowan provided the answer here: normalization considerations.

> 
> A number of combining characters (e.g. U+0340, U+0341, U+0343) have
> canonical equivalents, i.e. canonical decompositions that are a single
> character.  In other words, we have pairs of codepoints that are bound
> to behave in exactly the same manner under all circumstances.  What's
> the deal?

Assertions by the UTC that these are effectively duplicate characters.
U+0340 and U+0341 were mistakes made when Vietnamese was being
considered for encoding. U+0343 is a result of the Greek committee
asserting that Koronis and Psili are distinct (which is how they got
into 10646), but the UTC assertion is that they are the same *character*,
although with slightly different functions (as for different functions
of other accent marks and diacritics in Latin).

"Über das Zeichen der Mischung wird ein Häkchen, korooni's, gesetzt.
Es sieht aus wie ein spiritus lenis, ist aber keiner, denn dieser
kann nur am Anfang eines Wortes stehen. Die Koronis zeigt nur an, daß
eine Krasis stattgefinden hat."

"62. Crasis (kraasis _mingling_) is the contraction of a vowel or
diphthong at the end of a word with a vowel or diphthong beginning the
following word. Over the syllable resulting from contraction is
placed a {{apostrophe glyph}} called coroonis (hook), as ... [examples
follow]" -- Smythe.

So you can get a psili (smooth breathing, spiritus lenis) at the
beginning of a vowel-initial Greek word in polytoniko, but a
koronis in the middle of a word, over the vowel where crasis has
occurred. In any case, the hook itself appears identical, and there
is no character encoding reason to distinguish them. Hence
the canonical decomposition of U+0343.

> Unicode contains a number of precomposed spacing diacritical marks for
> Greek (e.g. U+1FC1).  However, and unless I've missed something, with
> the exception of U+0385, they do not have combining (non-spacing)
> versions.  What's the rationale here?

They have combining versions of each of the pieces, for the
composite ones. There is no good reason to invent composite
combining marks involving two accents together. (In fact, there
are good reasons *not* to do so.) The few that exist, e.g. U+0344,
cause implementation problems and are discouraged from use.

> 
> (Similar precomposed diacritical marks do not seem to exist for
> Vietnamese, which makes me think they've been included for
> compatibility with legacy encodings rather than for a good reason.

The Greek precomposed spacing accents came in as a lump from
the Greek national body ELOT for polytoniko Greek, and had to
be accepted into Unicode as part of the merger compromise with
10646.

> Still, because their decompositions are not canonical, they need to be
> taken into account, which in my case complicates what would otherwise
> be somewhat cleaner code.)

True enough. The UTC disliked them from the beginning, but had no
choice in the matter.

> When rendering stacked combining characters (i.e. sequences of
> combining characters with the same non-zero combining class), which
> sequences need to be treated specially (as opposed to being stacked on
> top of each other)?  I already know about the pairs needed for Greek
> (both Mono- and Polytonic) and Vietnamese.

I don't know of any other regular, language-specific exceptions.
But you can expect to occasionally run into typographically-based
exceptional behavior whenever an orthography results in a requirement
to stack diacritics top or bottom.

> 
> As far as I can tell, there is nothing in the Unicode database that
> relates a ``modifier letter'' to the associated punctuation mark.  Is
> that right? 

Correct. They are viewed as distinct classes.

> Does anyone have such data that I could steal?
> (Hopefully with no legal strings attached.)
> 

> On a related note, does anyone has a map from mathematical characters
> to the Geometric Shapes, Misc. symbols and Dingbats that would be
> useful for rendering?

As opposed to the characters themselves? I'm not sure what you
are getting at here. An example would perhaps help.

--Ken

Re: A few questions about decomposition, equvalence and rendering

Reply via email to