Indic editing (was: RE: The real solution)

2001-11-26 Thread Marco Cimarosti

As we all know, Unicode is a "logical" encoding, in the sense that it
assigns codes to "abstract characters", rather than to the actual signs
("glyphs") which are visible on a printed page. This design principle has
been chosen because it makes all non-visual text processing much easier.

Recently, Arjun Aggarwal this principle has been criticized for Devanagari,
on the ground that the elements of an Unicode Devanagari string do not
correspond to the graphic elements of Devanagari text.

Several people have explained in detail how this is not an acceptable
criticism, because Unicode code points are NOT supposed to be displayed with
a direct one-to-one mapping to glyphs. 

I think that this criticism was addressed adequately, for what concerns the
ENCODING part, and that it is now Mr. Aggarwal turn to make an effort to
understand better what he is criticizing.


However, I think that only considering the encoding point of view does not
catch the real reasons behind the discontent are periodically expressed by
Indian users and engineers.

It has always been my impression that, for a native user of Indic scripts,
it is much more natural to work with visual glyphs.

Why shouldn't it be so? When you write "Arjun" with a pencil, you trace:
, , , <-u>, , , , exactly in this order.

Who cares if, by the lexicographic point of view,  plus 
constitutes a unit? Who cares if, by the phonological point of view, 
is pronounced before ? Who cares if, by a logical point of view, 
is a  plus a virtual "virama"?

Yet, by the graphical point of view, that name is spelled using that
sequence of *glyphs*.

Similarly, what the users see on a computer screen are *glyphs*, not
abstract characters. Consequently, they should be enabled to interact
(enter, modify, delete) the *glyphs*.

How can users be asked to enter, modify or delete objects (such as "virama",
"ZW(N)J") which are not visible and tangible on the screen? Or how can they
be asked to interact with an entity which is in a certain position,
pretending that it was somewhere else (repha, short i matra)? And why
should it be forbidden to edit visible and tangible objects (such as the
"danda" at the right side of many letters) on the basis that "logically they
do not exist"?


See the difference between the name "Arjun" as coded (©) in terms of Unicode
characters, and as rendered (®) in terms of glyphs (for a visual
representation of this example, see the attached file ARJUN.GIF.):

©  a  ra  virama  ja  -u  na
®  a  j-  danda  -u  repha  n-  danda

Unicode requires that © form is converted to ® form before being displayed.
This process  is called "rendering" and, for Devanagari, it could be
summarized in four logical steps:

1:  Convert character codes into "glyph codes";
2:  Join some glyphs (e.g.: turn ra + virama into repha);
3:  Reorder some glyphs (e.g.: move repha to its visual position);
4:  Split some glyphs (e.g.: turn full C's into half C's + danda)

(Notice that this is a very schematic algorithm, and that actual
implementations can vary considerably; especially point 1 and 4 may be
dropped.)

In the case of "Arjun", the four steps perform the following changes (see
again ARJUN.GIF):

1:  a  ra virama  ja -u na
2:  a  repha  ja -u na
3:  a ja -u  repha  na 
4:  a j-  danda  -u  repha  n-  danda


So far so good: I see "Arjun" on the screen.

But what if now I want to change "Arjun" into, say, "Aljun"? By the
"logical" point of view, I should simply delete the ra and enter a la in the
same position.

But, on my screen, there is no ra at all! Moreover, there is no consonant at
all before the ja, because the group ra+virama is displayed as a combining
repha AFTER the j+danda+u group.

Looking at the screen, the natural thing to do is to move to the repha and
delete it, then move between the a and the ja and insert a half la.

In order to accomplish a WYSIWYG editing of this kind, Unicode text should
be preventively converted to a TEMPORARY INTERMEDIATE FORM, less "logic" and
more "visual".

In the case of Devanagari, a glyphic representation quite similar to the old
"font encodings" should be used. With such an intermediate code, the user
should be enabled to select and delete the danda of a letter to form a half
letter, to enter or delete a matra i or a repha by placing the cursor in
their visual position, and so on.

The algorithm to convert Unicode to this intermediate glyphic representation
already exists, and it is the four steps that I described above, which are
now part of rendering engines and smart fonts.

The difference is that this algorithm should be run BEFORE going into the
visualization phase.

The big difference is that editing actions should be executed on this
intermediate code and, therefore, there is the need of a "DErendering"
algorithm, which converts a portion of visual text back to real Unicode.


A very similar thing 

Re: Indic editing (was: RE: The real solution)

2001-11-26 Thread Kenneth Whistler

Marco wrote:

> 
> In the case of "Arjun", the four steps perform the following changes (see
> again ARJUN.GIF):
> 
> 1:  a  ra virama  ja -u na
> 2:  a  repha  ja -u na
> 3:  a ja -u  repha  na 
> 4:  a j-  danda  -u  repha  n-  danda
> 
> 
> So far so good: I see "Arjun" on the screen.
> 
> But what if now I want to change "Arjun" into, say, "Aljun"? By the
> "logical" point of view, I should simply delete the ra and enter a la in the
> same position.
> 
> But, on my screen, there is no ra at all! Moreover, there is no consonant at
> all before the ja, because the group ra+virama is displayed as a combining
> repha AFTER the j+danda+u group.
> 
> Looking at the screen, the natural thing to do is to move to the repha and
> delete it, then move between the a and the ja and insert a half la.

Actually, I would disagree with this. Trying to select and edit a
repha, or any other mark above or below another letter is a pain,
both to implement and from the point of view of a user trying to
work with selection.

My answer to this is that the natural thing to do is to cursor down
before the na to get an insertion point. Then:

   backspace backspace backspace backspace la virama ja u

Or, in terms of backing store:

   a  ra  virama  ja  -u  |  na
   a  ra  virama  ja  |  na
   a  ra  virama  |  na
   a  ra |  na
   a  |  na
   a  la |  na
   a  la  virama |  na
   a  la  virama  ja |  na
   a  la  virama  ja  -u  |  na

And I'm done. 8 keystrokes after the cursor down, but more efficient
than trying to mess with selecting the repha.

Consider how often people will correct spelling errors, for example,
by backspacing and retyping, rather than trying to select to a
specific spot to correct and then having to reselect back to the
original spot to continue. It is simply more efficient to do it
this way.

And the above example could be even more efficient if the editing
system implemented the backspace/erase function to clobber syllable
parts (or grapheme clusters) instead of character at a time. But
you also have to consider ergonomic issues there. It may introduce
inefficiencies and mistakes if a backspace/erase deletes more
characters than one keystroke's worth of entry. One principle of
low-level editing (without IME input/select/commit operations)
ought generally to be: key key key erase erase erase
should leave you with no change to text.

> 
> In order to accomplish a WYSIWYG editing of this kind, Unicode text should
> be preventively converted to a TEMPORARY INTERMEDIATE FORM, less "logic" and
> more "visual".

I'm not suggesting that this isn't also a possible approach to implementing
Devanagari editing -- just that the issue of what a user does to
deal with editing existing text, under the current Unicode model,
isn't that big a deal for repha and its ilk. On the other hand, the
reordrant vowels might well lend themselves to editor extensions that
work in a visual mode as well as a logical mode.

--Ken





Re: Indic editing (was: RE: The real solution)

2001-11-26 Thread James E. Agenbroad

  Monday, November 26, 2001
It seems to me that we have three separate domains to deal with:
 1. What should be keyed as input of Indic scripts, mainly Devanagari?
 2. How shall Indic scripts data be stored and exchanged?
 3. How should Indic scripts be displayed on screens and in print?

ISCII and Unicode are not concerned with the first.  They are very
concerned with the second. There may be general agreement on the third,
but a variety of output devices are involved.  Unless ISCII changed from a
phonetic based approach to a graphic based one I doubt that Unicode and ISO 
10646 would even consider doing so. Having attended a meeting in 1982 of
those who drafted ISCII I doubt that this will happen. Might it be
possible to key data in user oriented glyph/graphic fashion and then
convert it to a phonetic encoding for storage, processing and sharing?  And
then, for rendering, convert it from phoneitc encoding to whatever the
local display needed (OS, fonts, etc.) for human consumption?  I do not
know if agreement could be achieved on keyboard layouts for Indian
scripts; though desirable to facilitate mobility of those with keying
skills, such standardization may not be necessry--both qwerty and Dvorak
keyboards can result in ASCII data.   
 Regards,
  Jim Agenbroad (disclaimer and addresses at bottom)
 On Mon, 26 Nov 2001, Marco Cimarosti wrote:

> As we all know, Unicode is a "logical" encoding, in the sense that it
> assigns codes to "abstract characters", rather than to the actual signs
> ("glyphs") which are visible on a printed page. This design principle has
> been chosen because it makes all non-visual text processing much easier.
> 
> Recently, Arjun Aggarwal this principle has been criticized for Devanagari,
> on the ground that the elements of an Unicode Devanagari string do not
> correspond to the graphic elements of Devanagari text.
> 
> Several people have explained in detail how this is not an acceptable
> criticism, because Unicode code points are NOT supposed to be displayed with
> a direct one-to-one mapping to glyphs. 
> 
> I think that this criticism was addressed adequately, for what concerns the
> ENCODING part, and that it is now Mr. Aggarwal turn to make an effort to
> understand better what he is criticizing.
> 
> 
> However, I think that only considering the encoding point of view does not
> catch the real reasons behind the discontent are periodically expressed by
> Indian users and engineers.
> 
> It has always been my impression that, for a native user of Indic scripts,
> it is much more natural to work with visual glyphs.
> 
> Why shouldn't it be so? When you write "Arjun" with a pencil, you trace:
> , , , <-u>, , , , exactly in this order.
> 
> Who cares if, by the lexicographic point of view,  plus 
> constitutes a unit? Who cares if, by the phonological point of view, 
> is pronounced before ? Who cares if, by a logical point of view, 
> is a  plus a virtual "virama"?
> 
> Yet, by the graphical point of view, that name is spelled using that
> sequence of *glyphs*.
> 
> Similarly, what the users see on a computer screen are *glyphs*, not
> abstract characters. Consequently, they should be enabled to interact
> (enter, modify, delete) the *glyphs*.
> 
> How can users be asked to enter, modify or delete objects (such as "virama",
> "ZW(N)J") which are not visible and tangible on the screen? Or how can they
> be asked to interact with an entity which is in a certain position,
> pretending that it was somewhere else (repha, short i matra)? And why
> should it be forbidden to edit visible and tangible objects (such as the
> "danda" at the right side of many letters) on the basis that "logically they
> do not exist"?
> 
> 
> See the difference between the name "Arjun" as coded (©) in terms of Unicode
> characters, and as rendered (®) in terms of glyphs (for a visual
> representation of this example, see the attached file ARJUN.GIF.):
> 
> ©  a  ra  virama  ja  -u  na
> ®  a  j-  danda  -u  repha  n-  danda
> 
> Unicode requires that © form is converted to ® form before being displayed.
> This process  is called "rendering" and, for Devanagari, it could be
> summarized in four logical steps:
> 
> 1:  Convert character codes into "glyph codes";
> 2:  Join some glyphs (e.g.: turn ra + virama into repha);
> 3:  Reorder some glyphs (e.g.: move repha to its visual position);
> 4:  Split some glyphs (e.g.: turn full C's into half C's + danda)
> 
> (Notice that this is a very schematic algorithm, and that actual
> implementations can vary considerably; especially point 1 and 4 may be
> dropped.)
> 
> In the case of "Arjun", the four steps perform the following changes (see
> again ARJUN.GIF):
> 
> 1:  a  ra virama  ja -u na
> 2:  a  repha  ja -u na
> 3:  a ja -u  repha  na 
> 4:  a j-  danda  -u  r

Re: Indic editing (was: RE: The real solution)

2001-11-26 Thread Charlie Jolly

The debate about Indic does highlight important issues for the end user.

(I am not a Hindi reader/writer.)

Hindi text inputted under Notepad in XP exhibits the following behaviour.
Backspacing deletes characters as they appear to be typed.
Pressing delete before characters deletes blocks of characters.

Charlie Jolly


- Original Message -
From: "Kenneth Whistler" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Monday, November 26, 2001 4:14 PM
Subject: Re: Indic editing (was: RE: The real solution)


> Marco wrote:
>
> >
> > In the case of "Arjun", the four steps perform the following changes
(see
> > again ARJUN.GIF):
> >
> > 1:  a  ra virama  ja -u na
> > 2:  a  repha  ja -u na
> > 3:  a ja -u  repha  na
> > 4:  a j-  danda  -u  repha  n-  danda
> >
> >
> > So far so good: I see "Arjun" on the screen.
> >
> > But what if now I want to change "Arjun" into, say, "Aljun"? By the
> > "logical" point of view, I should simply delete the ra and enter a la in
the
> > same position.
> >
> > But, on my screen, there is no ra at all! Moreover, there is no
consonant at
> > all before the ja, because the group ra+virama is displayed as a
combining
> > repha AFTER the j+danda+u group.
> >
> > Looking at the screen, the natural thing to do is to move to the repha
and
> > delete it, then move between the a and the ja and insert a half la.
>
> Actually, I would disagree with this. Trying to select and edit a
> repha, or any other mark above or below another letter is a pain,
> both to implement and from the point of view of a user trying to
> work with selection.
>
> My answer to this is that the natural thing to do is to cursor down
> before the na to get an insertion point. Then:
>
>backspace backspace backspace backspace la virama ja u
>
> Or, in terms of backing store:
>
>a  ra  virama  ja  -u  |  na
>a  ra  virama  ja  |  na
>a  ra  virama  |  na
>a  ra |  na
>a  |  na
>a  la |  na
>a  la  virama |  na
>a  la  virama  ja |  na
>a  la  virama  ja  -u  |  na
>
> And I'm done. 8 keystrokes after the cursor down, but more efficient
> than trying to mess with selecting the repha.
>
> Consider how often people will correct spelling errors, for example,
> by backspacing and retyping, rather than trying to select to a
> specific spot to correct and then having to reselect back to the
> original spot to continue. It is simply more efficient to do it
> this way.
>
> And the above example could be even more efficient if the editing
> system implemented the backspace/erase function to clobber syllable
> parts (or grapheme clusters) instead of character at a time. But
> you also have to consider ergonomic issues there. It may introduce
> inefficiencies and mistakes if a backspace/erase deletes more
> characters than one keystroke's worth of entry. One principle of
> low-level editing (without IME input/select/commit operations)
> ought generally to be: key key key erase erase erase
> should leave you with no change to text.
>
> >
> > In order to accomplish a WYSIWYG editing of this kind, Unicode text
should
> > be preventively converted to a TEMPORARY INTERMEDIATE FORM, less "logic"
and
> > more "visual".
>
> I'm not suggesting that this isn't also a possible approach to
implementing
> Devanagari editing -- just that the issue of what a user does to
> deal with editing existing text, under the current Unicode model,
> isn't that big a deal for repha and its ilk. On the other hand, the
> reordrant vowels might well lend themselves to editor extensions that
> work in a visual mode as well as a logical mode.
>
> --Ken
>
>
>
>





What is meant by '\u0000'?

2001-11-26 Thread juuichiketajin



> As for cut & paste, it might work among Microsoft Apps
> but if one  wants to interface an app  with a disclosed
> clipboard  format he will realize that he can not paste
> unicode text that  contains '\u'  characters. Impossible.

Does he mean specifically the character U+, or rather any character referenced by 
hex codepoint?

Hex codepoints (or something similar) are sometimes needed to keep ASCII-only systems 
from trashing your data. I have used them so much I took my copy of the Unicode 
standard and on the hiragana page, wrote some decimal equivalents of hex numbers there.
Hex codepoints are an excellent idea when making a display that is to be shown on 
browsers set to who knows what codepage.

Tex, how did you do the name page, anyway?

It would be useful to have a utility where you type text and out come the '\u' 
type strings (or else HTML hash codes) for use in a Java program or Web page.
-- 

___
Get your free email from http://www.ranmamail.com

Powered by Outblaze