As we all know, Unicode is a "logical" encoding, in the sense that it
assigns codes to "abstract characters", rather than to the actual signs
("glyphs") which are visible on a printed page. This design principle has
been chosen because it makes all non-visual text processing much easier.

Recently, Arjun Aggarwal this principle has been criticized for Devanagari,
on the ground that the elements of an Unicode Devanagari string do not
correspond to the graphic elements of Devanagari text.

Several people have explained in detail how this is not an acceptable
criticism, because Unicode code points are NOT supposed to be displayed with
a direct one-to-one mapping to glyphs. 

I think that this criticism was addressed adequately, for what concerns the
ENCODING part, and that it is now Mr. Aggarwal turn to make an effort to
understand better what he is criticizing.


However, I think that only considering the encoding point of view does not
catch the real reasons behind the discontent are periodically expressed by
Indian users and engineers.

It has always been my impression that, for a native user of Indic scripts,
it is much more natural to work with visual glyphs.

Why shouldn't it be so? When you write "Arjun" with a pencil, you trace:
<a>, <j->, <danda>, <-u>, <repha>, <n->, <danda>, exactly in this order.

Who cares if, by the lexicographic point of view, <j-> plus <danda>
constitutes a unit? Who cares if, by the phonological point of view, <repha>
is pronounced before <j->? Who cares if, by a logical point of view, <repha>
is a <ra> plus a virtual "virama"?

Yet, by the graphical point of view, that name is spelled using that
sequence of *glyphs*.

Similarly, what the users see on a computer screen are *glyphs*, not
abstract characters. Consequently, they should be enabled to interact
(enter, modify, delete) the *glyphs*.

How can users be asked to enter, modify or delete objects (such as "virama",
"ZW(N)J") which are not visible and tangible on the screen? Or how can they
be asked to interact with an entity which is in a certain position,
pretending that it was somewhere else (repha, short i matra)? And why
should it be forbidden to edit visible and tangible objects (such as the
"danda" at the right side of many letters) on the basis that "logically they
do not exist"?


See the difference between the name "Arjun" as coded (©) in terms of Unicode
characters, and as rendered (®) in terms of glyphs (for a visual
representation of this example, see the attached file ARJUN.GIF.):

    ©  a  ra  virama  ja  -u  na
    ®  a  j-  danda  -u  repha  n-  danda

Unicode requires that © form is converted to ® form before being displayed.
This process  is called "rendering" and, for Devanagari, it could be
summarized in four logical steps:

    1:  Convert character codes into "glyph codes";
    2:  Join some glyphs (e.g.: turn ra + virama into repha);
    3:  Reorder some glyphs (e.g.: move repha to its visual position);
    4:  Split some glyphs (e.g.: turn full C's into half C's + danda)

(Notice that this is a very schematic algorithm, and that actual
implementations can vary considerably; especially point 1 and 4 may be
dropped.)

In the case of "Arjun", the four steps perform the following changes (see
again ARJUN.GIF):

    1:  a  ra     virama  ja         -u         na
    2:  a  repha          ja         -u         na
    3:  a                 ja         -u  repha  na 
    4:  a                 j-  danda  -u  repha  n-  danda


So far so good: I see "Arjun" on the screen.

But what if now I want to change "Arjun" into, say, "Aljun"? By the
"logical" point of view, I should simply delete the ra and enter a la in the
same position.

But, on my screen, there is no ra at all! Moreover, there is no consonant at
all before the ja, because the group ra+virama is displayed as a combining
repha AFTER the j+danda+u group.

Looking at the screen, the natural thing to do is to move to the repha and
delete it, then move between the a and the ja and insert a half la.

In order to accomplish a WYSIWYG editing of this kind, Unicode text should
be preventively converted to a TEMPORARY INTERMEDIATE FORM, less "logic" and
more "visual".

In the case of Devanagari, a glyphic representation quite similar to the old
"font encodings" should be used. With such an intermediate code, the user
should be enabled to select and delete the danda of a letter to form a half
letter, to enter or delete a matra i or a repha by placing the cursor in
their visual position, and so on.

The algorithm to convert Unicode to this intermediate glyphic representation
already exists, and it is the four steps that I described above, which are
now part of rendering engines and smart fonts.

The difference is that this algorithm should be run BEFORE going into the
visualization phase.

The big difference is that editing actions should be executed on this
intermediate code and, therefore, there is the need of a "DErendering"
algorithm, which converts a portion of visual text back to real Unicode.


A very similar thing has been discussed months ago on this list about
bidirectional editing. I find that the process of reversing Indic rendering
is even easier than a "reverse bidi" algorithm.

It is possible to DErender Devanagari text by running the same rendering
algorithm listed before backwards and with reversed meanings:

    4:  Join some glyphs (e.g.: turn half C's + danda into full C's);
    3:  Reorder some glyphs (e.g.: more repha to its logical position);
    2:  Split some glyphs (e.g.: turn repha into ra + virama);
    1:  Convert glyph codes into character codes.

In the case of "Arjun", the four steps perform the following changes (see
again ARJUN.GIF, reading the four points from bottom to top):

    4:  a                 j-  danda  -u  repha  n-  danda
    3:  a                 ja         -u  repha  na 
    2:  a  repha          ja         -u         na
    1:  a  ra     virama  ja         -u         na

Notice that such an intermediate code can however be slightly MORE abstract
than a mere list of glyph variants: tiny and insignificant variations (such
as the different heights or sizes of combining glyphs, or the choice of
ligatures that are not strictly mandatory) may still be left to smart fonts
to handle.


Just my 0.02 euros.

_ Marco

Attachment: arjun.gif
Description: GIF image

Reply via email to