RE: Indic editing (was: RE: The real solution)

2001-12-20 Thread Roozbeh Pournader


A dead thread, but worth to note that:

On Tue, 18 Dec 2001, Marco Cimarosti wrote:

  Would you kindly tell me how i can construct such input methods and
  ultimately create fonts.
 
 Er... It is not so easy to do this kind of things yourself. You should buy
 (or, however, get) software that properly supports Devanagari.

You can also get Pango (http://www.pango.org/). It's a Free library that 
supports Unicode's Devanagari and other Indic scripts on both Linux and 
Windows.

roozbeh





Re: Indic editing (was: RE: The real solution)

2001-12-03 Thread Arjun Aggarwal

Hi Everybody

The statement by Mr. John Hudson that the system of  the fact that phonetic
keyboarding, while the norm for the Indian publishing and typesetting
industries, was not the norm for typewriters  is not entirely correct.
It was not the norm earlier but is the current norm for many years now.

Moreover, the concept of la = half la + danda may be natural for people
who are used to typewriters and typography. Which is, some of the people
who are more likely to switch to computers.

I fully agree with Mr. Marco Cimarosti in this regard.

This is the point to which i really wanted everybody to focus on i.e. the
problem of encoding as well as display .

Yes, there are many easy solutions. The fact is that this are worth nothing
until Unicode officially adopts one of them.

This is the ultimate truth and this was the main point with which i
initiated this dicussion .

With Regards
Arjun Aggarwal
[EMAIL PROTECTED]







Re: Indic editing (was: RE: The real solution)

2001-12-03 Thread Michael \(michka\) Kaplan

From: Arjun Aggarwal [EMAIL PROTECTED]

 Moreover, the concept of la = half la + danda may be natural for people
 who are used to typewriters and typography. Which is, some of the people
 who are more likely to switch to computers.

 I fully agree with Mr. Marco Cimarosti in this regard.

 This is the point to which i really wanted everybody to focus on i.e. the
 problem of encoding as well as display .

Well, you do need to understand that you could actually create input methods
that would allow people who wish to type this way to do so -- and the
underlyhing data could still be stored using the current encoding.

The needs of those who wish to keep their keyboards can be met without
trying to undo all the implementations that have been done.

--
MichKa

Michael Kaplan
Trigeminal Software, Inc.  -- http://www.trigeminal.com/






RE: Indic editing (was: RE: The real solution)

2001-12-03 Thread Marco Cimarosti

Arjun Aggarwal wrote:
 Moreover, the concept of la = half la + danda may be 
 natural for people
 who are used to typewriters and typography. Which is, some 
 of the people
 who are more likely to switch to computers.
 
 I fully agree with Mr. Marco Cimarosti in this regard.
 
 This is the point to which i really wanted everybody to focus 
 on i.e. the problem of encoding as well as display .

Therefore, you don't fully agree with me.

My opinion is that the encoding is OK as it is in ISCII and Unicode. I take
in consideration your way of splitting the graphemes *only* at the editing
level.

_ Marco




RE: Indic editing (was: RE: The real solution)

2001-12-03 Thread Marco Cimarosti

O, by the way, I forgot this...

Arjun Aggarwal wrote:
 Yes, there are many easy solutions. The fact is that this 
 are worth nothing
 until Unicode officially adopts one of them.
 
 This is the ultimate truth and this was the main point with which i
 initiated this dicussion .

Almost every sentence may become the ultimate truth, if you remove enough
context to make it meaningless.

I can say a lot of tupid things on my own, and I don't need anybody's help
to put more stupid things in my mouth. Thanks.

My sentence above referred to a very specific problem: finding a way of
mapping the ISCII sequence RA + HALANT + INV to Unicode.

Here is the sentence in its original context:

Marco Cimarosti wrote:
 Dhrubajyoti Banerjee wrote:
[...]
  Marco Cimarosti wrote:
[...]
  I am talking again about REPHA IN ISOLATION: ISCII has a way of 
  representing
  it, but Unicode does not. This is needed, even only for 
  encoding didactic
  texts, and a solution to encode it (with ZWJ, probably) 
  should be found.
  
  I think the same way it is done in ISCII would be quite okay.
  In ISCII you get it by typing the INV character after ra virama.
  A similiar solution may be provided for, in Unicode, by 
 using ZW(N)J.
 
 Yes, there are many easy solutions. The fact is that this are 
 worth nothing
 until Unicode officially adopts one of them.

_ Marco




Re: Indic editing (was: RE: The real solution)

2001-11-28 Thread Asmus Freytag

At 12:37 PM 11/27/01 -0800, James Kass wrote:
Isn't that where it belongs?  Default display for isolated combining
marks shows them with the dotted circle.

No it does not. That's an artifact of the Unicode code chart notation.

25CC in many fonts (and in the charts for that matter) looks different
than the dotted circle we are using for the charts.

A./




Re: Indic editing (was: RE: The real solution)

2001-11-28 Thread James Kass

Asmus Freytag wrote,


 At 12:37 PM 11/27/01 -0800, James Kass wrote:
 Isn't that where it belongs?  Default display for isolated combining
 marks shows them with the dotted circle.
 
 No it does not. That's an artifact of the Unicode code chart notation.
 
 25CC in many fonts (and in the charts for that matter) looks different
 than the dotted circle we are using for the charts.
 

In the Baraha Devanagari Unicode font, the repha is a non-spacing glyph.
In MSANGAM.TTF, there are two rephas, both are non-spacing.

Is the repha supposed to be a spacing mark?  If not, doesn't a non-spacing
mark need to be applied to a space or spacing mark to avoid display
problems?

For Bengali, on this system (Win M.E. MSIE 5.5) the default appearance of
U+0981 BENGALI SIGN CANDRABINDU when it appears alone in a cell is
to be displayed atop U+25CC.  This is expected now, but was a bit of a 
surprise at first.  However, U+0982 and U+0983, ANUSVARA and 
VISARGA, are also displayed following U+25CC if they are isolated.  This 
one is arguable.  The unexpected is that non-characters, like U+0984, are 
displayed as U+25CC followed by the null or missing glyph.  (The null alone 
for unassigned code points should be enough.)

This dotted circle is being added to the display by the system, and happens
when Indic script is being displayed with an OpenType font covering the 
script range.

Quoting from Microsoft's OpenType specification page at:
http://www.microsoft.com/typography/otspec/indicot/other.htm

 For the fallback mechanism to work properly, an 
  Indian font should contain a glyph for the dotted 
  circle (U+25CC). In case this glyph is missing form 
  the font, the invalid signs will be displayed on the 
  missing glyph shape (white box). 

(These OpenType for Indic pages at Microsoft may not have been
updated since April 2000, so maybe there's a revision pending.)

At first, this default (fallback mechanism) display looked bad.  The font 
had the dotted circle rather large to match other circles in that same 
Unicode range.  So, is the solution to adjust the appearance of that glyph 
in any OpenType font aimed at Indic, or is there a preferred method? 

Best regards,

James Kass.








RE: Indic editing (was: RE: The real solution)

2001-11-28 Thread Marco Cimarosti

John Hudson wrote:
 Eight keystrokes to replace a single character isn't exactly 
 what I would
 call an efficient solution. [...] At this 
 conditions, it would be
 simpler to delete the whole words and type it from scratch.
 
 FWIW:
 This is exactly what a lot of people would do, even if only a single, 
 fairly easily selectable character needs changing.

That's what I often do myself when I misspell a short word such as Arjun.

But if I did a small error in a long word, I'd rather go back and edit just
the offending letter. I think that we all want this possibility, and nobody
would appreciate a system where Delete and Backspace delete whole words by
default.

Ken's and my discussion is a sort of slow-motion analysis of what goes on
while typing text. We used a short word just in order to keep the example
short. But feel free to apply the same concepts to cases such as
Bhagavadgitopanishad

 When I'm typing, I'm 
 processing words in my head, not strings of characters, 

And Indian users too shouldn't be forced to process strings of *abstract*
characters into their heads!

But an editing systems which directly uses the ISCII/Unicode encoding
elements forces users to understand the details of the algorithm for
rendering complex scripts, and to continuously run this algorithm forwards
and backwards into their heads, in order to understand where they should
place their cursor to delete or enter characters.

I was speculating about how to let the users alone with the signs of their
script, leaving the task of running algorithms to the computer.

 and it is easier to 
 delete and retype a whole word -- to step back and then 
 continue my train 
 of thought -- than to interrupt my thought to select an individiual 
 character. I don't think efficiency in input can necessarily 
 be measured by number of keystrokes.

I did not only compare the number of keystrokes (which, anyway, is a valid
measure of efficiency), I also analyzed the visual effect of each keystroke,
comparing it to the result that is intuitively expectable.

What I found is that, in many cases, what happens on the screen after
pressing a key is puzzling, unless one has a firm understanding of the
Unicode character/glyph model, and continuously thinks at this model while
typing.

_ Marco





RE: Indic editing (was: RE: The real solution)

2001-11-28 Thread Michael Everson

At 12:14 +0100 2001-11-28, Marco Cimarosti wrote:

And Indian users too shouldn't be forced to process strings of *abstract*
characters into their heads!

Indian users have been using the ISCII model for decades.

Ever see a Hindi mechanical typewriter layout?
-- 
Michael Everson *** Everson Typography *** http://www.evertype.com




RE: Indic editing (was: RE: The real solution)

2001-11-28 Thread Asmus Freytag

At 12:32 PM 11/28/01 +0100, Marco Cimarosti wrote:

I don't think that Unicode requires that a non spacing mark *has* to be
placed on something in order to be displayable. However, some fonts may
chose to represent a stand-alone non spacing mark as floating on some
default glyph, for either technological or esthetic reasons.

As for example at the beginning of a string.

If it's not at the beginning, it is *always* placed on something, i.e. 
whatever it is preceded by, whether that's intended or not. That's the 
reason for the rule about using a space (or NB space), which can be found 
in section 7.9 (p180 of Unicode 3.0).

A./




RE: Indic editing (was: RE: The real solution)

2001-11-27 Thread John Hudson

At 02:41 11/27/2001, Marco Cimarosti wrote:

Eight keystrokes to replace a single character isn't exactly what I would
call an efficient solution. You have a six character word, and your solution
requires deleting and retyping four of them. At this conditions, it would be
simpler to delete the whole words and type it from scratch.

FWIW:
This is exactly what a lot of people would do, even if only a single, 
fairly easily selectable character needs changing. When I'm typing, I'm 
processing words in my head, not strings of characters, and it is easier to 
delete and retype a whole word -- to step back and then continue my train 
of thought -- than to interrupt my thought to select an individiual 
character. I don't think efficiency in input can necessarily be measured by 
number of keystrokes.

John Hudson

Tiro Typeworks  www.tiro.com
Vancouver, BC   [EMAIL PROTECTED]

... es ist ein unwiederbringliches Bild der Vergangenheit,
das mit jeder Gegenwart zu verschwinden droht, die sich
nicht in ihm gemeint erkannte.

... every image of the past that is not recognized by the
present as one of its own concerns threatens to disappear
irretrievably.
   Walter Benjamin





Re: Indic editing (was: RE: The real solution)

2001-11-27 Thread James Kass


Marco Cimarosti wrote,

  
  Or, perhaps U+25D6, the combining circle.  RA+VIRAMA+COMB.CIRC. =
  illustration form for isolated repha?
 
 U+25D6 is LEFT HALF BLACK CIRCLE. Perhaps you meant U+25CC DOTTED CIRCLE?
 

Yep, the tired eyes skipped a line and got the hex equivalent for #9686
instead of #9676.

 However, that would not be a repha in isolation: it would be floating on
 some sort of symbol.
 

Isn't that where it belongs?  Default display for isolated combining
marks shows them with the dotted circle.

Best regards,

James Kass.






Indic editing (was: RE: The real solution)

2001-11-26 Thread Marco Cimarosti

As we all know, Unicode is a logical encoding, in the sense that it
assigns codes to abstract characters, rather than to the actual signs
(glyphs) which are visible on a printed page. This design principle has
been chosen because it makes all non-visual text processing much easier.

Recently, Arjun Aggarwal this principle has been criticized for Devanagari,
on the ground that the elements of an Unicode Devanagari string do not
correspond to the graphic elements of Devanagari text.

Several people have explained in detail how this is not an acceptable
criticism, because Unicode code points are NOT supposed to be displayed with
a direct one-to-one mapping to glyphs. 

I think that this criticism was addressed adequately, for what concerns the
ENCODING part, and that it is now Mr. Aggarwal turn to make an effort to
understand better what he is criticizing.


However, I think that only considering the encoding point of view does not
catch the real reasons behind the discontent are periodically expressed by
Indian users and engineers.

It has always been my impression that, for a native user of Indic scripts,
it is much more natural to work with visual glyphs.

Why shouldn't it be so? When you write Arjun with a pencil, you trace:
a, j-, danda, -u, repha, n-, danda, exactly in this order.

Who cares if, by the lexicographic point of view, j- plus danda
constitutes a unit? Who cares if, by the phonological point of view, repha
is pronounced before j-? Who cares if, by a logical point of view, repha
is a ra plus a virtual virama?

Yet, by the graphical point of view, that name is spelled using that
sequence of *glyphs*.

Similarly, what the users see on a computer screen are *glyphs*, not
abstract characters. Consequently, they should be enabled to interact
(enter, modify, delete) the *glyphs*.

How can users be asked to enter, modify or delete objects (such as virama,
ZW(N)J) which are not visible and tangible on the screen? Or how can they
be asked to interact with an entity which is in a certain position,
pretending that it was somewhere else (repha, short i matra)? And why
should it be forbidden to edit visible and tangible objects (such as the
danda at the right side of many letters) on the basis that logically they
do not exist?


See the difference between the name Arjun as coded (©) in terms of Unicode
characters, and as rendered (®) in terms of glyphs (for a visual
representation of this example, see the attached file ARJUN.GIF.):

©  a  ra  virama  ja  -u  na
®  a  j-  danda  -u  repha  n-  danda

Unicode requires that © form is converted to ® form before being displayed.
This process  is called rendering and, for Devanagari, it could be
summarized in four logical steps:

1:  Convert character codes into glyph codes;
2:  Join some glyphs (e.g.: turn ra + virama into repha);
3:  Reorder some glyphs (e.g.: move repha to its visual position);
4:  Split some glyphs (e.g.: turn full C's into half C's + danda)

(Notice that this is a very schematic algorithm, and that actual
implementations can vary considerably; especially point 1 and 4 may be
dropped.)

In the case of Arjun, the four steps perform the following changes (see
again ARJUN.GIF):

1:  a  ra virama  ja -u na
2:  a  repha  ja -u na
3:  a ja -u  repha  na 
4:  a j-  danda  -u  repha  n-  danda


So far so good: I see Arjun on the screen.

But what if now I want to change Arjun into, say, Aljun? By the
logical point of view, I should simply delete the ra and enter a la in the
same position.

But, on my screen, there is no ra at all! Moreover, there is no consonant at
all before the ja, because the group ra+virama is displayed as a combining
repha AFTER the j+danda+u group.

Looking at the screen, the natural thing to do is to move to the repha and
delete it, then move between the a and the ja and insert a half la.

In order to accomplish a WYSIWYG editing of this kind, Unicode text should
be preventively converted to a TEMPORARY INTERMEDIATE FORM, less logic and
more visual.

In the case of Devanagari, a glyphic representation quite similar to the old
font encodings should be used. With such an intermediate code, the user
should be enabled to select and delete the danda of a letter to form a half
letter, to enter or delete a matra i or a repha by placing the cursor in
their visual position, and so on.

The algorithm to convert Unicode to this intermediate glyphic representation
already exists, and it is the four steps that I described above, which are
now part of rendering engines and smart fonts.

The difference is that this algorithm should be run BEFORE going into the
visualization phase.

The big difference is that editing actions should be executed on this
intermediate code and, therefore, there is the need of a DErendering
algorithm, which converts a portion of visual text back to real Unicode.


A very similar thing 

Re: Indic editing (was: RE: The real solution)

2001-11-26 Thread Kenneth Whistler

Marco wrote:

 
 In the case of Arjun, the four steps perform the following changes (see
 again ARJUN.GIF):
 
 1:  a  ra virama  ja -u na
 2:  a  repha  ja -u na
 3:  a ja -u  repha  na 
 4:  a j-  danda  -u  repha  n-  danda
 
 
 So far so good: I see Arjun on the screen.
 
 But what if now I want to change Arjun into, say, Aljun? By the
 logical point of view, I should simply delete the ra and enter a la in the
 same position.
 
 But, on my screen, there is no ra at all! Moreover, there is no consonant at
 all before the ja, because the group ra+virama is displayed as a combining
 repha AFTER the j+danda+u group.
 
 Looking at the screen, the natural thing to do is to move to the repha and
 delete it, then move between the a and the ja and insert a half la.

Actually, I would disagree with this. Trying to select and edit a
repha, or any other mark above or below another letter is a pain,
both to implement and from the point of view of a user trying to
work with selection.

My answer to this is that the natural thing to do is to cursor down
before the na to get an insertion point. Then:

   backspace backspace backspace backspace la virama ja u

Or, in terms of backing store:

   a  ra  virama  ja  -u  |  na
   a  ra  virama  ja  |  na
   a  ra  virama  |  na
   a  ra |  na
   a  |  na
   a  la |  na
   a  la  virama |  na
   a  la  virama  ja |  na
   a  la  virama  ja  -u  |  na

And I'm done. 8 keystrokes after the cursor down, but more efficient
than trying to mess with selecting the repha.

Consider how often people will correct spelling errors, for example,
by backspacing and retyping, rather than trying to select to a
specific spot to correct and then having to reselect back to the
original spot to continue. It is simply more efficient to do it
this way.

And the above example could be even more efficient if the editing
system implemented the backspace/erase function to clobber syllable
parts (or grapheme clusters) instead of character at a time. But
you also have to consider ergonomic issues there. It may introduce
inefficiencies and mistakes if a backspace/erase deletes more
characters than one keystroke's worth of entry. One principle of
low-level editing (without IME input/select/commit operations)
ought generally to be: key key key erase erase erase
should leave you with no change to text.

 
 In order to accomplish a WYSIWYG editing of this kind, Unicode text should
 be preventively converted to a TEMPORARY INTERMEDIATE FORM, less logic and
 more visual.

I'm not suggesting that this isn't also a possible approach to implementing
Devanagari editing -- just that the issue of what a user does to
deal with editing existing text, under the current Unicode model,
isn't that big a deal for repha and its ilk. On the other hand, the
reordrant vowels might well lend themselves to editor extensions that
work in a visual mode as well as a logical mode.

--Ken