Re: Character identities

Jim Allan Thu, 31 Oct 2002 20:36:09 -0800

In Unicode code point U+308 is applied to COMBINING DIAERESIS. There are a number of precomposed forms with diaeresis.

Let's take one of these, ü:

The diaeresis may mean separate pronunication of the u, indicating it is not merged with preceding of following letter but is pronounced distinctly, as in the classical Greek name Peirithoüs or Spanish antigüedad. Similarly in Catalan. It is identified with the Greek dialytika of the same meaning, which is indeed the ultimate known origin of the symbol.
The diaeresis indicates umlaut modification of u, as in German über, a use also found in Finnish, Turkish, Pinyin Chinese Romanization and in many other languages.
In Magyar indicates a sound like French eu.
In IPA it indicates u with a centralized pronunciation.

There are may be other phonic interpretations.

Of these uses, only for the second (and possibly the third), might combining superscript e be used instead of the diaeresis. The second certainly represents the most common use of ü tody, but not the only only one.

Unicode encodes the character COMBINING DIAERESIS, not a generic UMLAUT MARKER which might take various forms. It provides itself no way of distinguishing between uses of diaeresis.

All the above uses might occur in German text, or Swedish text, or Finnish text or any text which might introduce personal names or geographical names or particular words or phrases from various languages outside the main language of the text. The same applies for ä and ö.

Indeed individual words with vowels and umlaut marker, whether represented as a COMBINING DIAERESIS or COMBINING LATIN SMALL LETTER E or following e may appear in text in any language because of use of technical vocabulary, eg. Senhnsücht, or in personal or place names.

Now any use of diaeresis meaning umlaut in any language might, it seems to me, be reasonably replaced by superscript e meaning umlaut. But it is incorrect to replace diaeresis used for any other purpose by superscript e.

In stright, plain Unicode, if you want to use diaeresis for umlaut, use diaeresis. If you want to use combining superscript e to indicate umlaut, use COMBINING LATIN SMALL LETTER E. Leave any other occurrences of umlaut alone. This is the only possiblitiy at the plain text level, and the most robust way of chosing between diaeresis and superscript e at any level.

Given a higher protocol, we can do more. We might, as suggested, have a font which uses superscript e instead of diaeresis, at least for the combination characters with the base characters a, o, or u and in place of the diaeresis symbol itself. If we have another generally identical font with a true diaeresis instead, we can switch between fonts as necessary depending on whether diaeresis is used for umlaut or not, or whether in particular cases we wish to use one or the other symbol for umlaut.

Switching between such alternate fonts as long been a standby when fancy typography is required.

Yet I don't see there is any advantage to switching betwen between fonts and switching between the Unicode character COMBINING DIAERESIS and COMBINING LATIN SMALL LETTER E. And it makes us dependent on a particular set of fonts. That is probably not good. :-(

A better solution might be an intelligent font that recognizes some kinds of tagging and which allows us to turn on different glyphs for diaeresis according to the tagging, one of these glyphs being a superscript e. So we tag words and phrases. And, magically, if that particular font works properly, we see diaeresis where we want diaeresis and superscript e where we want superscript e.

But it is not evident that tagging for this purpose is any easier than entering the different Unicode characters from the beginning. And we are again dependent on the intelligence of a particular font. Of course, we might expect there will be soon be many such intelligent fonts. It is less likely that they will all work exactly the same, and understand exactly the same tags in the same way. And we are restricted to such intelligent fonts as understand a particular system of tagging rather than using almost any font. :-(

We might propose introducing a tag or indicator of some kind at some level to indicate a diaeresis has umlaut function, but such a tag or indicator would probably only be used when a user wanted to use a superscript e, in which case it is not clear that using it would have any advantage over actually entering COMBINING LATIN SMALL LETTER E. :-(

We might go to a still higher level of protocol, to a routine or plugin in an application or a new style feature added to HTML or XML which allows diaeresis replacement. Just as Microsoft Word and some other programs now allow capitalization and small capitalization as an effect, though the underlying text is still actually in upper and lower case, so we might show a diaeresis as a superscript e, though in fact at the plain text level the text has a diaeresis. Presumably for viewing and printing the application would substitute Unicode COMBINING LATIN SMALL LETTER E without actually changing the underlying text.

We might eventually be able to translate between applications globally.

Yet ....

Is it not simpler and easier and far more robust that search engines begin to recognize a weak equivalence between COMBINING LATIN SMALL LETTER Eand diaeresis and that text processing applications, particularly ones intended for use with German, allow easy user-controlled interchange of diaeresis and superscript e at the Unicode plain text level without particular font dependencies? :-)

The user might not even know the characters are represented by different code points.

The diaeresis is less universally a version of a superscript letter e then the cedilla is a version of the letter z, but one would probably not want any normal font to replace ç with z topped by a superscript c. The cedilla has long lost its unity with z.

Similarly one would not normally want a font to replace å by aa or th by þ, or a font for French that replaces the circumflex accent with COMBINING LATIN SMALL LETTER S, though such substitutions might also be considered as stylistic from some points of view. A font is the wrong level to make such substitutions robustly.

Again, should IPA symbols be replaced by the corresponding characters in Americanist phonetic useage by a font? This would could quite reasonably be argued to be only a stylistic change. The characters mean the same, after all.

But Unicode generally encodes characters not glyphs; and encodes characters, not their meanings.

Jim Allan

Re: Character identities

Reply via email to