RE: Latin w/ diacritics (was Re: benefits of unicode)

Marco Cimarosti Wed, 18 Apr 2001 12:43:55 -0700
Peter Constable wrote:
> >> In TrueType/OpenType, glyphs don't have to be mapped (assigned to
> >> code points).
> >This is a myth that I hope to see eradicated as soon as possible.
> Marco, you are generating a myth that I hope not to see catch 
> on. James is absolutely right.

Sorry, I have been quite clumsy, one more time.

I was using the term "mapping" with a more low-level connotation: given a
number from a certain set (in this case, Unicode code points) doing a
look-up in a dictionary (in this case, the cmap and other tables in a "smart
font"), in order to find out a corresponding number in a different set (in
this case, the random glyph id inside a particular font).

While displaying unicode text this "mapping" process has to be done, in a
way or another, and it requires a certain amount of time and memory.

With *this* meaning in mind, I meant: every Unicode renderer needs the time
and memory to do this "mapping", so I don't see why an OpenType-based
renderer on Win NT should perform better than a BDF renderer on X Win.

It was only after your and James' replies that I realized that, in font
technology, the term "mapped" referred to a glyph means that that glyph "is
in the cmap".

James Kass wrote:
> MC> 
> > Sorry, I miss the implication of this.
> > If the user can't access them it is probably because she 
> > doesn't need [...]
> Exactly, and I'm sorry for failing to be clear.  In a Latin font, the
> character LATIN CAPITAL LETTER A WITH ACUTE may be a compound
> glyph formed internally (by the font, not an engine) by combining 
> the "A" glyph, which is mapped, with a "capital letter 
> combining acute" glyph, which need not be mapped.

OK: "mapped" in the sense "being in the cmap".

I would have improperly said that character "LATIN CAPITAL LETTER A WITH
ACUTE" maps to the glyphs "A" + "capital letter combining acute".

I should find an alternative term that does not cause confusions with the
mainstream TT/OT technology, such as "look-up".

> You had wondered if the proposed PUA use wasn't just to have
> a place to store the glyphs in the font which might be needed
> by the rendering engine, I was pointing out that since such glyphs
> don't need to be mapped, there would be no need of the PUA for
> that reason.  I misunderstood your point at first, the PUA encodings
> would be needed in order to display the glyphs on the older OS,

I was reasoning about using the PUA for storing these presentation glyphs
only because you mentioned it (quoting Andrew Cunningham), and I assumed
that you were talking about the implementation of rendering engines on
systems that have no "smart fonts".

Now I re-read you message, and I realize that you were rather talking about
interim encodings.

However, no, I don't think that such interim encodings are a very good idea.
The reason is that this encourages users to produce corpora of e-texts that
will become unreadable as soon as those private convention will be
forgotten.

Of course, when entire scripts are unencoded, there is no other choice for
*encoding* them.

But the matter is different when the text can already be *encoded* but not
yet *rendered*.
In this case, I would justify whatever dirty trick to provide interim ways
of *displaying* the text -- but the files themselves should be encoded is a
way, because they might have a life much longer than the software used to
view or edit them.

> Displaying dot matrices on VGA screens also isn't rocket science.

OK. But it would bring Unicode on a whole set of devices that will never be
able to support TrueType (not to mention "smart fonts").

> The reason I mentioned converting a page to PUA rather than 
> using an internal display buffer is that the source page would 
> only need to be processed once, and then it could be operated upon 
> by any application on the system.  Another reason for using PUA 
> at all is that apps already exist which can handle PUA and new apps 
> wouldn't have to be built.  With some kind of ad hoc PUA registry for
> Latin w/diacritics, only one conversion program would be needed
> to cover the hundreds or thousands of languages involved.

If I had the time and resources to maintain such a registry, and to develop
all the necessary conversion softwares, and all the necessary documentation
to explain to people why, how and when they have to use such tools... Well,
I would rather devote all that time and resources in the implementation of
stacking diacritics on more and more systems.

> The 'cmap', as you say, converts code points to glyph IDs,
> nothing more.
> But, I'm not sure what you mean by using pseudo-Unicode scalars as
> glyph indexes.  The first glyph in the font is glyph index zero, the
> second glyph in the font is glyph index one, and so forth.  

OK. Once again, I attributed a wider meaning to a term which has a very well
defined meaning in TrueType.

By "glyph id" I meant whatever number (or key) that enables a program to
look up a given glyph from a font.

The kind of font that I had in mind was a very naive bitmapped font such as
Unifont. In this kind of technology, there is no "cmap" to mediate between
the internal ids generated by font designer and the agreed-upon ids that are
known outside (such as the code points in a standard character set).

Consequently, the program using such font must know and use the glyph ids
directly to query the font. For this reason, the glyph ids often correspond
to the code points in a character set, rather than being random
machine-generated number.

However, supporting Unicode (or any other "complex encoding" such as ISCII)
with such a technology requires the program to also allocate somewhere ids
for glyphs that don't have a code point in the character set.

Also in this case, I should invent some alternative term to avoid confusion.
"Glyph codes"?

Now, most characters have to a single glyph, and the "glyph code" for those
characters would be identical to the characters code point. E.g.: character
U+00A2 -> glyph 0x00A2.

In some cases a single character corresponds to several glyph codes, but
Unicode is so merciful to provide poor renderers with appropriate
"presentation glyphs". E.g.: character U+0628 -> glyphs 0xFE91, 0xFE92,
0xFE90 or 0xFE8F.

In other cases, these extra glyphs do not exist in Unicode and, therefore,
they must be allocated to some value that does not conflict with any
interesting code point. One popular area for such codes is the PUA. E.g.:
character U+0712 -> glyphs 0xE002, 0xE003, 0xE004 or 0xE005.

Another possible choice is finding values that are outside the Unicode
space. E.g.: character U+0712 -> glyphs 0xFF0002, 0xFF0003, 0xFF0004 or
0xFF0005.

I know that all this probably sounds very ridiculous to people used to
OpenType or ATSUI.

But it is also very ridiculous that a retail application all built in
Windows 2000, mounting the cutest Unicode-enabled database engine, should
ban Unicode (and 4/5 of its potential market) only because a bloody
(mandatory) green-leds price display cannot use OpenType.

A nutshell implementation of Unicode rendering with bitmapped fonts can
bring the multilingual revolution of Unicode also on this kind of small
"embedded" devices.

If you think that OpenType is the only way, please look around when you go
back home (driving your car, sitting on your train, walking through the
downtown roads) and ask yourself:

- "would that traffic display be able use OpenType fonts?";

- "would that display on the front of the train wok if the name of the last
station was in Arabic?";

- "would that crystal liquid strip on the front of my car be able to display
'fuel has ended' in Devanagari?"

- Etc. I am sure everybody leaving in a big city may find plenty more
examples.

(Of course, this exercise is only for people in rich countries. People
elsewhere is still waiting for their 486's...)

_ Marco
RE: Latin w/ diacritics (was Re: benefits of unicode)

Reply via email to