--- Asmus Freytag <[EMAIL PROTECTED]> wrote: > > > >All of the above can be composed through following consonant clusters: > > jna -> ja halant nya > > shra -> sha halant ra > > ksh -> ka halant ssha > > > >The point that the above sequences are considered as characters in some > of > >the Indian languages has merit. If there is demand from native speakers > >then a proposal can be submitted to Unicode. There is a predefined > >procedure for proposal submission. Once this is discussed with concerned > >people and agreed upon then these ligatures can be added in Devanagari > >script itself because Devenagari script represent all three languages > you > >mentioned namely Sanskrit, Marathi, and Hindi. Meanwhile you can write > >rules for composing them from the consonant clusters. > > I wouldn't go so far. The fact that clusters belong together is something > > that can be handled by the software. Collation and other data processing > needs to deal with such issues already for many other languages. See > http://www.unicode.org/reports/tr10 on the collation algorithm.
I beg to differ with you on this point. Merely having some provision for composing a character doesn't mean that the character is not a candidate for inclusion as separate code point. India is a big country with millions of people geographically divided and speaking variety of languages. Sentiments are attached with cultures which may vary from one geographical area to another. So when one of the many languages falling under the same script dominate the entire encoding for the script, then other group of people may feel that their language has not been represented properly in the encoding. While Unicode encodes scripts only, the aim was to provide sufficient representation to as many languages as possible. In Unicode many characters have been given codepoints regardless of the fact that the same character could have been rendered through some compose mechanism. This includes Indic scripts as well as other scripts. For example, in Devanagari script some code points are allocated to characters (ConsonantNukta) even though the same characters could be produced with combination of the consonant and Nukta. Similarly, in Latin-1 range [U+0080-U+00FF] there are few characters which can be produced otherwise. That is why the text should be normalized to either pre-composed or de-composed character sequence before going for further processing in operations like searching and sorting. Also, many times processing of text depends on the smallest addressable unit of that language. Again as discussed in earlier e-mails this may vary from one language to another in the same script. Consider a case when a language processor/application wants to count the number of characters in some text in order to find number of keystrokes required to input the text. Further assume that API functions used for this purpose are based on either WChar (wide characters) or UTF-8. In this case it is very much necessary that you assign the character, say Kssha, to the class "consonant". Since assignment to this class "consonant" applies to single code point (the smallest addressable unit) and not to the sequence of codes, it is very much necessary to have single code point for the character "Kssha". This is my understanding. Please enlighten me if I am wrong. Regards, Keyur __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com