RE: Indic editing (was: RE: The real solution)
A dead thread, but worth to note that: On Tue, 18 Dec 2001, Marco Cimarosti wrote: Would you kindly tell me how i can construct such input methods and ultimately create fonts. Er... It is not so easy to do this kind of things yourself. You should buy (or, however, get) software that properly supports Devanagari. You can also get Pango (http://www.pango.org/). It's a Free library that supports Unicode's Devanagari and other Indic scripts on both Linux and Windows. roozbeh
Re: Indic editing (was: RE: The real solution)
Hi Everybody The statement by Mr. John Hudson that the system of the fact that phonetic keyboarding, while the norm for the Indian publishing and typesetting industries, was not the norm for typewriters is not entirely correct. It was not the norm earlier but is the current norm for many years now. Moreover, the concept of la = half la + danda may be natural for people who are used to typewriters and typography. Which is, some of the people who are more likely to switch to computers. I fully agree with Mr. Marco Cimarosti in this regard. This is the point to which i really wanted everybody to focus on i.e. the problem of encoding as well as display . Yes, there are many easy solutions. The fact is that this are worth nothing until Unicode officially adopts one of them. This is the ultimate truth and this was the main point with which i initiated this dicussion . With Regards Arjun Aggarwal [EMAIL PROTECTED]
Re: Indic editing (was: RE: The real solution)
From: Arjun Aggarwal [EMAIL PROTECTED] Moreover, the concept of la = half la + danda may be natural for people who are used to typewriters and typography. Which is, some of the people who are more likely to switch to computers. I fully agree with Mr. Marco Cimarosti in this regard. This is the point to which i really wanted everybody to focus on i.e. the problem of encoding as well as display . Well, you do need to understand that you could actually create input methods that would allow people who wish to type this way to do so -- and the underlyhing data could still be stored using the current encoding. The needs of those who wish to keep their keyboards can be met without trying to undo all the implementations that have been done. -- MichKa Michael Kaplan Trigeminal Software, Inc. -- http://www.trigeminal.com/
RE: Indic editing (was: RE: The real solution)
Arjun Aggarwal wrote: Moreover, the concept of la = half la + danda may be natural for people who are used to typewriters and typography. Which is, some of the people who are more likely to switch to computers. I fully agree with Mr. Marco Cimarosti in this regard. This is the point to which i really wanted everybody to focus on i.e. the problem of encoding as well as display . Therefore, you don't fully agree with me. My opinion is that the encoding is OK as it is in ISCII and Unicode. I take in consideration your way of splitting the graphemes *only* at the editing level. _ Marco
RE: Indic editing (was: RE: The real solution)
O, by the way, I forgot this... Arjun Aggarwal wrote: Yes, there are many easy solutions. The fact is that this are worth nothing until Unicode officially adopts one of them. This is the ultimate truth and this was the main point with which i initiated this dicussion . Almost every sentence may become the ultimate truth, if you remove enough context to make it meaningless. I can say a lot of tupid things on my own, and I don't need anybody's help to put more stupid things in my mouth. Thanks. My sentence above referred to a very specific problem: finding a way of mapping the ISCII sequence RA + HALANT + INV to Unicode. Here is the sentence in its original context: Marco Cimarosti wrote: Dhrubajyoti Banerjee wrote: [...] Marco Cimarosti wrote: [...] I am talking again about REPHA IN ISOLATION: ISCII has a way of representing it, but Unicode does not. This is needed, even only for encoding didactic texts, and a solution to encode it (with ZWJ, probably) should be found. I think the same way it is done in ISCII would be quite okay. In ISCII you get it by typing the INV character after ra virama. A similiar solution may be provided for, in Unicode, by using ZW(N)J. Yes, there are many easy solutions. The fact is that this are worth nothing until Unicode officially adopts one of them. _ Marco
Re: Indic editing (was: RE: The real solution)
At 12:37 PM 11/27/01 -0800, James Kass wrote: Isn't that where it belongs? Default display for isolated combining marks shows them with the dotted circle. No it does not. That's an artifact of the Unicode code chart notation. 25CC in many fonts (and in the charts for that matter) looks different than the dotted circle we are using for the charts. A./
Re: Indic editing (was: RE: The real solution)
Asmus Freytag wrote, At 12:37 PM 11/27/01 -0800, James Kass wrote: Isn't that where it belongs? Default display for isolated combining marks shows them with the dotted circle. No it does not. That's an artifact of the Unicode code chart notation. 25CC in many fonts (and in the charts for that matter) looks different than the dotted circle we are using for the charts. In the Baraha Devanagari Unicode font, the repha is a non-spacing glyph. In MSANGAM.TTF, there are two rephas, both are non-spacing. Is the repha supposed to be a spacing mark? If not, doesn't a non-spacing mark need to be applied to a space or spacing mark to avoid display problems? For Bengali, on this system (Win M.E. MSIE 5.5) the default appearance of U+0981 BENGALI SIGN CANDRABINDU when it appears alone in a cell is to be displayed atop U+25CC. This is expected now, but was a bit of a surprise at first. However, U+0982 and U+0983, ANUSVARA and VISARGA, are also displayed following U+25CC if they are isolated. This one is arguable. The unexpected is that non-characters, like U+0984, are displayed as U+25CC followed by the null or missing glyph. (The null alone for unassigned code points should be enough.) This dotted circle is being added to the display by the system, and happens when Indic script is being displayed with an OpenType font covering the script range. Quoting from Microsoft's OpenType specification page at: http://www.microsoft.com/typography/otspec/indicot/other.htm For the fallback mechanism to work properly, an Indian font should contain a glyph for the dotted circle (U+25CC). In case this glyph is missing form the font, the invalid signs will be displayed on the missing glyph shape (white box). (These OpenType for Indic pages at Microsoft may not have been updated since April 2000, so maybe there's a revision pending.) At first, this default (fallback mechanism) display looked bad. The font had the dotted circle rather large to match other circles in that same Unicode range. So, is the solution to adjust the appearance of that glyph in any OpenType font aimed at Indic, or is there a preferred method? Best regards, James Kass.
RE: Indic editing (was: RE: The real solution)
John Hudson wrote: Eight keystrokes to replace a single character isn't exactly what I would call an efficient solution. [...] At this conditions, it would be simpler to delete the whole words and type it from scratch. FWIW: This is exactly what a lot of people would do, even if only a single, fairly easily selectable character needs changing. That's what I often do myself when I misspell a short word such as Arjun. But if I did a small error in a long word, I'd rather go back and edit just the offending letter. I think that we all want this possibility, and nobody would appreciate a system where Delete and Backspace delete whole words by default. Ken's and my discussion is a sort of slow-motion analysis of what goes on while typing text. We used a short word just in order to keep the example short. But feel free to apply the same concepts to cases such as Bhagavadgitopanishad When I'm typing, I'm processing words in my head, not strings of characters, And Indian users too shouldn't be forced to process strings of *abstract* characters into their heads! But an editing systems which directly uses the ISCII/Unicode encoding elements forces users to understand the details of the algorithm for rendering complex scripts, and to continuously run this algorithm forwards and backwards into their heads, in order to understand where they should place their cursor to delete or enter characters. I was speculating about how to let the users alone with the signs of their script, leaving the task of running algorithms to the computer. and it is easier to delete and retype a whole word -- to step back and then continue my train of thought -- than to interrupt my thought to select an individiual character. I don't think efficiency in input can necessarily be measured by number of keystrokes. I did not only compare the number of keystrokes (which, anyway, is a valid measure of efficiency), I also analyzed the visual effect of each keystroke, comparing it to the result that is intuitively expectable. What I found is that, in many cases, what happens on the screen after pressing a key is puzzling, unless one has a firm understanding of the Unicode character/glyph model, and continuously thinks at this model while typing. _ Marco
RE: Indic editing (was: RE: The real solution)
At 12:14 +0100 2001-11-28, Marco Cimarosti wrote: And Indian users too shouldn't be forced to process strings of *abstract* characters into their heads! Indian users have been using the ISCII model for decades. Ever see a Hindi mechanical typewriter layout? -- Michael Everson *** Everson Typography *** http://www.evertype.com
RE: Indic editing (was: RE: The real solution)
At 12:32 PM 11/28/01 +0100, Marco Cimarosti wrote: I don't think that Unicode requires that a non spacing mark *has* to be placed on something in order to be displayable. However, some fonts may chose to represent a stand-alone non spacing mark as floating on some default glyph, for either technological or esthetic reasons. As for example at the beginning of a string. If it's not at the beginning, it is *always* placed on something, i.e. whatever it is preceded by, whether that's intended or not. That's the reason for the rule about using a space (or NB space), which can be found in section 7.9 (p180 of Unicode 3.0). A./
RE: Indic editing (was: RE: The real solution)
At 02:41 11/27/2001, Marco Cimarosti wrote: Eight keystrokes to replace a single character isn't exactly what I would call an efficient solution. You have a six character word, and your solution requires deleting and retyping four of them. At this conditions, it would be simpler to delete the whole words and type it from scratch. FWIW: This is exactly what a lot of people would do, even if only a single, fairly easily selectable character needs changing. When I'm typing, I'm processing words in my head, not strings of characters, and it is easier to delete and retype a whole word -- to step back and then continue my train of thought -- than to interrupt my thought to select an individiual character. I don't think efficiency in input can necessarily be measured by number of keystrokes. John Hudson Tiro Typeworks www.tiro.com Vancouver, BC [EMAIL PROTECTED] ... es ist ein unwiederbringliches Bild der Vergangenheit, das mit jeder Gegenwart zu verschwinden droht, die sich nicht in ihm gemeint erkannte. ... every image of the past that is not recognized by the present as one of its own concerns threatens to disappear irretrievably. Walter Benjamin
Re: Indic editing (was: RE: The real solution)
Marco Cimarosti wrote, Or, perhaps U+25D6, the combining circle. RA+VIRAMA+COMB.CIRC. = illustration form for isolated repha? U+25D6 is LEFT HALF BLACK CIRCLE. Perhaps you meant U+25CC DOTTED CIRCLE? Yep, the tired eyes skipped a line and got the hex equivalent for #9686 instead of #9676. However, that would not be a repha in isolation: it would be floating on some sort of symbol. Isn't that where it belongs? Default display for isolated combining marks shows them with the dotted circle. Best regards, James Kass.
Indic editing (was: RE: The real solution)
As we all know, Unicode is a logical encoding, in the sense that it assigns codes to abstract characters, rather than to the actual signs (glyphs) which are visible on a printed page. This design principle has been chosen because it makes all non-visual text processing much easier. Recently, Arjun Aggarwal this principle has been criticized for Devanagari, on the ground that the elements of an Unicode Devanagari string do not correspond to the graphic elements of Devanagari text. Several people have explained in detail how this is not an acceptable criticism, because Unicode code points are NOT supposed to be displayed with a direct one-to-one mapping to glyphs. I think that this criticism was addressed adequately, for what concerns the ENCODING part, and that it is now Mr. Aggarwal turn to make an effort to understand better what he is criticizing. However, I think that only considering the encoding point of view does not catch the real reasons behind the discontent are periodically expressed by Indian users and engineers. It has always been my impression that, for a native user of Indic scripts, it is much more natural to work with visual glyphs. Why shouldn't it be so? When you write Arjun with a pencil, you trace: a, j-, danda, -u, repha, n-, danda, exactly in this order. Who cares if, by the lexicographic point of view, j- plus danda constitutes a unit? Who cares if, by the phonological point of view, repha is pronounced before j-? Who cares if, by a logical point of view, repha is a ra plus a virtual virama? Yet, by the graphical point of view, that name is spelled using that sequence of *glyphs*. Similarly, what the users see on a computer screen are *glyphs*, not abstract characters. Consequently, they should be enabled to interact (enter, modify, delete) the *glyphs*. How can users be asked to enter, modify or delete objects (such as virama, ZW(N)J) which are not visible and tangible on the screen? Or how can they be asked to interact with an entity which is in a certain position, pretending that it was somewhere else (repha, short i matra)? And why should it be forbidden to edit visible and tangible objects (such as the danda at the right side of many letters) on the basis that logically they do not exist? See the difference between the name Arjun as coded (©) in terms of Unicode characters, and as rendered (®) in terms of glyphs (for a visual representation of this example, see the attached file ARJUN.GIF.): © a ra virama ja -u na ® a j- danda -u repha n- danda Unicode requires that © form is converted to ® form before being displayed. This process is called rendering and, for Devanagari, it could be summarized in four logical steps: 1: Convert character codes into glyph codes; 2: Join some glyphs (e.g.: turn ra + virama into repha); 3: Reorder some glyphs (e.g.: move repha to its visual position); 4: Split some glyphs (e.g.: turn full C's into half C's + danda) (Notice that this is a very schematic algorithm, and that actual implementations can vary considerably; especially point 1 and 4 may be dropped.) In the case of Arjun, the four steps perform the following changes (see again ARJUN.GIF): 1: a ra virama ja -u na 2: a repha ja -u na 3: a ja -u repha na 4: a j- danda -u repha n- danda So far so good: I see Arjun on the screen. But what if now I want to change Arjun into, say, Aljun? By the logical point of view, I should simply delete the ra and enter a la in the same position. But, on my screen, there is no ra at all! Moreover, there is no consonant at all before the ja, because the group ra+virama is displayed as a combining repha AFTER the j+danda+u group. Looking at the screen, the natural thing to do is to move to the repha and delete it, then move between the a and the ja and insert a half la. In order to accomplish a WYSIWYG editing of this kind, Unicode text should be preventively converted to a TEMPORARY INTERMEDIATE FORM, less logic and more visual. In the case of Devanagari, a glyphic representation quite similar to the old font encodings should be used. With such an intermediate code, the user should be enabled to select and delete the danda of a letter to form a half letter, to enter or delete a matra i or a repha by placing the cursor in their visual position, and so on. The algorithm to convert Unicode to this intermediate glyphic representation already exists, and it is the four steps that I described above, which are now part of rendering engines and smart fonts. The difference is that this algorithm should be run BEFORE going into the visualization phase. The big difference is that editing actions should be executed on this intermediate code and, therefore, there is the need of a DErendering algorithm, which converts a portion of visual text back to real Unicode. A very similar thing
Re: Indic editing (was: RE: The real solution)
Marco wrote: In the case of Arjun, the four steps perform the following changes (see again ARJUN.GIF): 1: a ra virama ja -u na 2: a repha ja -u na 3: a ja -u repha na 4: a j- danda -u repha n- danda So far so good: I see Arjun on the screen. But what if now I want to change Arjun into, say, Aljun? By the logical point of view, I should simply delete the ra and enter a la in the same position. But, on my screen, there is no ra at all! Moreover, there is no consonant at all before the ja, because the group ra+virama is displayed as a combining repha AFTER the j+danda+u group. Looking at the screen, the natural thing to do is to move to the repha and delete it, then move between the a and the ja and insert a half la. Actually, I would disagree with this. Trying to select and edit a repha, or any other mark above or below another letter is a pain, both to implement and from the point of view of a user trying to work with selection. My answer to this is that the natural thing to do is to cursor down before the na to get an insertion point. Then: backspace backspace backspace backspace la virama ja u Or, in terms of backing store: a ra virama ja -u | na a ra virama ja | na a ra virama | na a ra | na a | na a la | na a la virama | na a la virama ja | na a la virama ja -u | na And I'm done. 8 keystrokes after the cursor down, but more efficient than trying to mess with selecting the repha. Consider how often people will correct spelling errors, for example, by backspacing and retyping, rather than trying to select to a specific spot to correct and then having to reselect back to the original spot to continue. It is simply more efficient to do it this way. And the above example could be even more efficient if the editing system implemented the backspace/erase function to clobber syllable parts (or grapheme clusters) instead of character at a time. But you also have to consider ergonomic issues there. It may introduce inefficiencies and mistakes if a backspace/erase deletes more characters than one keystroke's worth of entry. One principle of low-level editing (without IME input/select/commit operations) ought generally to be: key key key erase erase erase should leave you with no change to text. In order to accomplish a WYSIWYG editing of this kind, Unicode text should be preventively converted to a TEMPORARY INTERMEDIATE FORM, less logic and more visual. I'm not suggesting that this isn't also a possible approach to implementing Devanagari editing -- just that the issue of what a user does to deal with editing existing text, under the current Unicode model, isn't that big a deal for repha and its ilk. On the other hand, the reordrant vowels might well lend themselves to editor extensions that work in a visual mode as well as a logical mode. --Ken