subject:"Character identities"

Re: Character identities

2002-11-01 Thread Doug Ewell

William Overington WOverington at ngo dot globalnet dot co dot uk
wrote:

 Would it be possible to define the U+FE00 variant sequence for a with
 two dots above it to be a with an e above it, and similarly U+FE00
 variant sequences for o with two dots above it and for u with two dots
 above it, and possibly for e with two dots above it as well?

It would be possible for the Unicode Technical Committee to define such
a standardized variant, though they have not elected to do so.  It would
*not* be possible for end users such as you or me to do so.

-Doug Ewell
 Fullerton, California

RE: Character identities

2002-10-31 Thread Kent Karlsson


Let me take a few comparable examples;

1. Some (I think font makers) a few years ago argued
   that the Lithuanian i-dot-circumflex was just a
   glyph variant (Lithuanian specific) of i-circumflex,
   and a few other similar characters.

   Still, the Unicode standard now does not regard those as
   glyph variants (anymore, if it ever did), and embodies
   that the Lithuanian i-dot-circumflex is a different
   character in its casing rules (see SpecialCasing.txt).
   There are special rules for inserting (when lowercasing)
   or removing (when uppercasing) dot-aboves on i-s and I-s
   for Lithuanian.  I can only conclude that it would be
   wrong even for a Lithuanian specific font to display an
   i-circumflex character as an i-dot-circumflex glyph,
   even though an i-circumflex glyph is never used for
   Lithuanian.

2. The Khmer script got allocated a KHMER SIGN BEYYAL.
   It stands (stood...) for any abbreviation of the
   Khmer correspondence to etc.; there are at least four
   different abbreviations, much like etc, etc., c,
   et c., ... It would be up to the font maker to decide
   exactly which abbreviation, and would vary by font.

   However, it is now targeted for deprecation for precisely
   that reason: it is *not* the font (maker) that should
   decide which abbreviation convention to use in a document,
   it is the *author* of the document who should decide.
   Just as for the Latin script, the author decides how to
   abbreviate et cetera. The way of abbreviating should stay
   the same *regardless of font*. Note that the font may be
   chosen at a much later time, and not for wanting to
   change abbreviation convention. That convention one
   may want to have the same throughout a document also
   when using several different fonts in it, not having to
   carefully consider abbreviation conventions when choosing
   fonts.

3. Marco would even allow (by default; I cannot get away
   from that caveat since some (not all) font technologies
   do what they do) displaying the ROMAN NUMERAL ONE THOUSAND
   C D (U+2180) as an M, and it would be up to the font
   designer. While the glyphs are informative, this glyphic
   substitution definitely goes too far.  If the author
   chose to use U+2180, a glyph having at least some
   similarity to the sample glyph should be shown, unless
   and until someone makes a (permanent or transient)
   explicit character change.

4. Some people write è instead of é (I claim they cannot
   spell...).  So is it up to a font designer to display
   é as è if the font is made for a context where many
   people does not make a distinction?  Can a correctly
   spelled name (say) be turned into an apparent misspelling
   by just choosing such a font?  And that would be a Unicode
   font?

5. I can't leave the ö vs. ø; these are just different
   ways of writing the same letter; and it is not
   the case that ø is used instead of ö for any 
   7-bit reasons. It is conventional to use ø for ö
   in Norway and Denmark for any Swedish name (or
   word) containing it.  The same goes for ä vs. æ.
   Why shouldn't this one be up to the font makers too?
   If the font is made purely for Norwegian, why not
   display ö as ø, as is the convention?  This is
   *exactly* the same situation as with ä vs. a^e.

I say, let the *author* decide in all these cases, and
let that decision stand, *regardless of font changes*.
[There is an implicit qualification there, but I'm
tired of writing it.]


 Kent Karlsson wrote:
   I insist that you can talk about character-to-character 
   mappings only when
   the so-called backing store is affected in some way.
  
  No, why?  It is perfectly permissible to do the equivalent
  of print(to_upper(mystring)) without changing the backing
  store (mystring in the pseudocode); to_upper here would
  return a NEW string without changing the argument.
 
 And that, conceptually, is a character-to-glyph mapping.

Now I have lost you.  How can it be that?  The print
part, yes. But not the to_upper part; that is a
character-to-character mapping, inserted between the
backing store and mapping characters to glyphs.
It is still an (apparent) character-to-character
mapping even if it is not stored in the backing store.

 In my mind, you are so much into the OpenType architecture, 
 and so much used
 to the concept that glyphization is what a font does, that 
 you can't view the big picture.

Now I have lost you again.  Some fonts (in some font
technologies) do more that pure glyphization. This
is why I have been putting in caveats, since many people
seem to think that all fonts *only* do glyphisation,
which is not the case.

But to be general I was referring to such mappings regardless
of if that is built into some font (using character code points
or, as in OT/AAT, using glyph indices) or (better) were external
to the font.

I was trying to use general formulations, but I cannot
avoid having caveats for certain mappings that certain
technologies do

[OT] Gthe (was: Re: RE: Character identities)

2002-10-31 Thread Doug Ewell

Adam Twardoch list dot adam at twardoch dot com wrote:

 Should an English language font render ö as oe,  so that Göthe
 appears automatically in the more normal English form Goethe?

 If you refer to Johann Wolfgang von Goethe, his name is *not* spelled
 with an ö anyway.

Somebody thinks so:

http://www.transkription.de/gb_seiten/beispiele/goethe.htm

-Doug Ewell
 Fullerton, California

Re: [OT] Gthe (was: Re: RE: Character identities)

2002-10-31 Thread Marc Wilhelm Küster

At 08:32 31.10.2002 -0800, Doug Ewell wrote:

Adam Twardoch list dot adam at twardoch dot com wrote:

 Should an English language font render Ã¶ as oe,  so that GÃ¶the
 appears automatically in the more normal English form Goethe?

 If you refer to Johann Wolfgang von Goethe, his name is *not* spelled
 with an Ã¶ anyway.

Somebody thinks so:

http://www.transkription.de/gb_seiten/beispiele/goethe.htm


Both forms are permissible and used, even though Goethe is today by far the 
more frequent version -- remember that there was no standardized German 
orthography before the late 19th century and that the idea that a person's 
name has exactly one spelling is a fairly young idea in Europe.

Taking such facts into account for matching purposes is a good idea, but
changing the version for rendering is not.

Best regards,

Marc



*
Marc Wilhelm Küster
Saphor GmbH

Fronländer 22
D-72072 Tübingen

Tel.: (+49) / (0)7472 / 949 100
Fax: (+49) / (0)7472 / 949 114

Re: Character identities

2002-10-31 Thread Anto'nio Martins-Tuva'lkin

(After sending this unadvertedly to Dominikus only, here's
for the list also...) On 2002.10.30, 16:26, Dominikus Scherkl
[EMAIL PROTECTED] wrote:

 A font representing my mothers handwriting (german only :-) would
 render u as u with breve above to distinguish it from the
 representation of n. I don't know how my mother would write a text
 containing an u with breve above,

FWIW, I've seen the handwriting of an elder German esperantist, and he
does exactly that: he puts breves above each and every u, both on
those which have it and on those which don't -- slightly confusing...

On the brink of off-topic-ness, something of that sort is made in
handwritten cyrillic (at least in Russian tradition): the triple wave
of a lower case t is distinguished from the triple wave of a lower
case shch (*) by means of a stroke above the former and a stroke below
the latter.

(*) Not that I'm an enthusiast of this transliteration...

--   .
António MARTINS-Tuválkin,   |  ()|
[EMAIL PROTECTED]   ||
R. Laureano de Oliveira, 64 r/c esq. |
PT-1885-050 MOSCAVIDE (LRS)  Não me invejo de quem tem   |
+351 917 511 549 carros, parelhas e montes   |
http://www.tuvalkin.web.pt/bandeira/ só me invejo de quem bebe   |
http://pagina.de/bandeiras/  a água em todas as fontes   |

Re: Character identities

2002-10-31 Thread Jim Allan

In Unicode code point U+308 is applied to COMBINING DIAERESIS.
There are a number of precomposed forms with diaeresis.

Let's take one of these, :

The diaeresis may mean separate pronunication of the u, indicating it is not merged with preceding
of following letter but is pronounced distinctly, as in the classical Greek
name Peirithos or Spanish antigedad. Similarly in Catalan. It
is identified with the Greek dialytika
of the same meaning, which is indeed the ultimate known origin of the symbol.

The diaeresis indicates umlaut modification of u, as in German ber, a use also found in Finnish, Turkish,
Pinyin Chinese Romanization and in many other languages.

In Magyar indicates a sound like French eu.

In IPA it indicates u with
a centralized pronunciation.

There are may be other phonic interpretations.

Of these uses, only for the second (and possibly the third), might combining
superscript e be used instead of
the diaeresis. The second certainly represents the most common use of tody, but not the only only one.

Unicode encodes the character COMBINING DIAERESIS, not a generic UMLAUT MARKER
which might take various forms. It provides itself no way of distinguishing
between uses of diaeresis.

All the above uses might occur in German text, or Swedish text, or Finnish
text or any text which might introduce personal names or geographical names
or particular words or phrases from various languages outside the main language
of the text. The same applies for
and .

Indeed individual words with vowels and umlaut marker, whether represented
as a COMBINING DIAERESIS or COMBINING LATIN SMALL LETTER
E or following e may appear
in text in any language because
of use of technical vocabulary, eg. Senhnscht,
or in personal or place names.

Now any use of diaeresis meaning umlaut in any language might, it seems to
me, be reasonably replaced by superscript e meaning umlaut. But it is incorrect
to replace diaeresis used for any other purpose by superscript e.

In stright, plain Unicode, if you want to use diaeresis for umlaut, use diaeresis.
If you want to use combining superscript e to indicate umlaut, use COMBINING
LATIN SMALL LETTER E. Leave
any other occurrences of umlaut alone. This is the only possiblitiy at the plain text level,
and the most robust way of chosing between diaeresis and superscript e at any level.

Given a higher protocol, we can do more. We might, as suggested, have a
font which uses superscript e instead
of diaeresis, at least for the combination characters with the base characters
a, o, or u and in place of the diaeresis symbol
itself. If we have another generally identical
font with a true diaeresis instead, we can switch between fonts as necessary
depending on whether diaeresis is used for umlaut or not, or whether in particular
cases we wish to use one or the other symbol for umlaut.

Switching between such alternate fonts as long been a standby when fancy
typography is required.

Yet I don't see there is any advantage to switching betwen between fonts
and switching between the Unicode character COMBINING DIAERESIS
and COMBINING LATIN SMALL LETTER E. And it makes us dependent on a particular
set of fonts. That is probably not good. :-(

A better solution might be an intelligent font that recognizes some kinds
of tagging and which allows us to turn on different glyphs for diaeresis according
to the tagging, one of these glyphs being a superscript e. So we tag words and phrases. And,
magically, if that particular font works properly, we see diaeresis where
we want diaeresis and superscript e where
we want superscript e.

But it is not evident that tagging for this purpose is any easier than
entering the different Unicode characters from the beginning. And we are
again dependent on the intelligence of a particular font. Of course, we might
expect there will be soon be many such intelligent fonts. It is less likely
that they will all work exactly the same, and understand exactly the same
tags in the same way. And we are restricted to such intelligent fonts as
understand a particular system of tagging rather than using almost any font.
:-(

We might propose introducing a tag or indicator of some kind at some level
to indicate a diaeresis has umlaut function, but such a tag or indicator would
probably only be used when a user wanted to use a superscript e, in
which case it is not clear that using it would have any advantage over actually
entering COMBINING LATIN SMALL LETTER E. :-(

We might go to a still higher level of protocol, to a routine or plugin in
an application or a new style feature added to HTML or XML which allows diaeresis
replacement. Just as Microsoft Word and some other programs now allow capitalization
and small capitalization as an effect, though the underlying text is still
actually in upper and lower case, so we might show a diaeresis as a superscript
e, though in fact at the plain text
level the text has a diaeresis. Presumably for viewing

92 matches

Mail list logo