At 02:13 -0800 2003-01-29, Keyur Shroff wrote:
I beg to differ with you on this point. Merely having some provision for
composing a character doesn't mean that the character is not a candidate
for inclusion as separate code point.
Yes, it does.

India is a big country with millions of people geographically divided and speaking variety of languages. Sentiments are attached with cultures which may vary from one geographical area to another. So when one of the many languages falling under the same script dominate the entire encoding for the script, then other group of people may feel that their language has not been represented properly in the encoding.
A lot of these "feelings" are simply WRONG, and that has to be faced. The syllable KSSA may be treated as a single letter, but this does not change the fact that it is a ligature of KA and SSA and that it can be represented in Unicode by a string of three characters.

In Unicode many characters have been given codepoints regardless of the
fact that the same character could have been rendered through some compose
mechanism. This includes Indic scripts as well as other scripts. For
example, in Devanagari script some code points are allocated to characters
(ConsonantNukta) even though the same characters could be produced with
combination of the consonant and Nukta.
There are historical and compatibility reasons that most of this stuff, as well as the similar stuff in the Latin range, were encoded. At one point some years ago the line was drawn, normalization was enacted, and that was that.

Also, many times processing of text depends on the smallest addressable
unit of that language. Again as discussed in earlier e-mails this may vary
from one language to another in the same script. Consider a case when a
language processor/application wants to count the number of characters in
some text in order to find number of keystrokes required to input the text.
I can't think of any reason why this would be useful. And what if you were not typing, but speaking to your computer? Then there would be no keystrokes at all!

Further assume that API functions used for this purpose are based on either
WChar (wide characters) or UTF-8. In this case it is very much necessary
that you assign the character, say Kssha, to the class "consonant". Since
assignment to this class "consonant" applies to single code point (the
smallest addressable unit) and not to the sequence of codes, it is very
much necessary to have single code point for the character "Kssha".
We are not going to encode KSSA as a single character. It is a ligature of KA and SSA, and can already be represented in Unicode. You need to handle this "consonant" issue with some other protocol.
--
Michael Everson * * Everson Typography * * http://www.evertype.com

Reply via email to