Re: discontent about Indic scripts and Unicode
"Carl W. Brown" wrote: > It looks like ISCII and Unicode are addressing two different multi-lingual > issues. Unicode deals with problems like Chinese where you have the same > writing for different spoken languages. I dont think so. ISCII deals with languages that share a "similar" writing system. > When it come to using the same > language or similar language that use different scripts that answer is > transliteration which is an implementation process that is independent of > the Unicode encoding. Indic languages are in general mutually incomphensible making them distinct, the writing systems are distinct but have similar structure since they originate from Brahmi script. IMHO transliteration maybe one of the goals but more importantly the goal of ISCII was to design a unified encoding system for Indic Scripts that have a similar structure. That is the reason why Perso-Arabic scripts were excluded and a different standard was envisaged. > > The ISCII is an attempt to provide cheap transliteration by using the same > encoding and just changing the font. You can not do that with Unicode. > However, I suspect that the transliteration approach will produce better > results if properly implemented. Not quite. There are characters in Southern Indic scripts that should be treated illegal in Northern Indic scripts and vice versa. ISCII is not clear on this, but ICU converter treats them as illegal. The only superset of all Indic scripts in ISCII is Devanagari ( with addition of characters for Southern scripts, Urdu, English,etc. ). So transliteration between Gurmukhi (Punjabi) and Telugu will fail if based only on byte values and these exceptions not are considered. > > However, I do not understand the TSCII for Tamil. Unicode provides the > script separation that they want. TSCII is whole different story. I agree with Michka, I too disagree with them. One amusing comment on TSCII list regarding conversion between Unicode and TSCII was "For all practical puposes convserion between TSCII and Unicode is equivalent to conversion between ISO-8859-1 and Unicode". Regards, Ram begin:vcard n:Viswanadha;Ram x-mozilla-html:FALSE org:IBM;International Components for Unicode adr:;; version:2.1 email;internet:[EMAIL PROTECTED] title:Unicode Software Engineer end:vcard
Re: discontent about Indic scripts and Unicode
From: "Carl W. Brown" <[EMAIL PROTECTED]> > However, I do not understand the TSCII for Tamil. Unicode > provides the script separation that they want. TSCII is mostly out of favor now (tamil.net being the main exception, and that only because its webmaster hates all established standards for doing anything!). The preferred encodings are TAM and TAB. As to why they prefer TA[M|B] to Unicode, the reasons are many. I happen to disagree with them, myself, as do many of the members of WG02 of INFITT, but thats a story for another day, I think? :-) MichKa Michael Kaplan Trigeminal Software, Inc. http://www.trigeminal.com/
RE: discontent about Indic scripts and Unicode
Ram, > ISCII has escape sequences which announce the start of a new Indic script. > An ATR char followed by special codepoint forms the escape sequence. > It is possible to support a page that contains different Indic > scripts.There are > > problems with the standard like, it assumes a default starting language, > which makes sense if the input is from keyboard and language is obtained > from the environment, but notl if data is exchanged between computers. It looks like ISCII and Unicode are addressing two different multi-lingual issues. Unicode deals with problems like Chinese where you have the same writing for different spoken languages. When it come to using the same language or similar language that use different scripts that answer is transliteration which is an implementation process that is independent of the Unicode encoding. The ISCII is an attempt to provide cheap transliteration by using the same encoding and just changing the font. You can not do that with Unicode. However, I suspect that the transliteration approach will produce better results if properly implemented. However, I do not understand the TSCII for Tamil. Unicode provides the script separation that they want. Carl
Re: discontent about Indic scripts and Unicode
On Wed, 19 Sep 2001, Rick McGowan wrote: > > If ISCII is still being developed does this suggest that Unicode and its ISO > > equivalent move too slowly? > > ISCII dates back to 1988 with a revision in 1990. It's not "still being > developed" -- as far as I know, it's a stable standard that is under > routine maintenance. > > I wonder if anyone has yet corresponded with the people who put up the > almost unbelievable misconceptions on the two web pages mentioned > yesterday? At least a note could go to the site owners, I would think. > > Rick Wednesday, September 19, 2001 I agree that the 1991 version of ISCII has been stable for representation of Indian scripts of Indian origin. I do not know if a standard for encoding of Perso-Arabic script for Urdu, Sindhi, etc. has advanced beyond being "envisaged" as mentioned in my earlier note. The term ISSCII (for Indian script standard code for information interchange) dates back at least to the July 1983 report of Government of India's Sub-Committee on Standardization of Indian Scripts and Their Codes for for Information Processing entitled "Standardization of Indian script codes for information interchange. (I'm not sure when the second 'S' was dropped.) Page iv of the 1991 standard is devoted to history and begins: "Since the 70s, different commitees of the Department of Official Langauges and the Department of Electronic (DOE) have been evolving different codes and keyboards which would cater to all the Indian scripts due to their common phonetic structure." Besides the 1983 version it mentions ones of 1986 and 1988. The 1983 report cites a March 1981 "interim report". Regards, Jim Agenbroad ( [EMAIL PROTECTED] ) The above are purely personal opinions, not necessarily the official views of any government or any agency of any. Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.
Re: discontent about Indic scripts and Unicode
> If ISCII is still being developed does this suggest that Unicode and its ISO > equivalent move too slowly? ISCII dates back to 1988 with a revision in 1990. It's not "still being developed" -- as far as I know, it's a stable standard that is under routine maintenance. I wonder if anyone has yet corresponded with the people who put up the almost unbelievable misconceptions on the two web pages mentioned yesterday? At least a note could go to the site owners, I would think. Rick
RE: discontent about Indic scripts and Unicode
On Wed, 19 Sep 2001, Carl W. Brown wrote: > Ram, > > If ISCII is intended as a pan-Indic solution does it also support Urdu? > > Carl > Wednesday, September 19, 2001 No, from the foreword to ISCII: "As Perso-Arabic scripts have a different alphabet, a different standard is envisaged for them." Regards, Jim Agenbroad ( [EMAIL PROTECTED] ) The above are purely personal opinions, not necessarily the official views of any government or any agency of any. Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.
Re: discontent about Indic scripts and Unicode
If ISCII is still being developed does this suggest that Unicode and its ISO equivalent move too slowly?
RE: discontent about Indic scripts and Unicode
Ram, If ISCII is intended as a pan-Indic solution does it also support Urdu? Carl
RE: discontent about Indic scripts and Unicode
Ram, > > ISCII has escape sequences which announce the start of a new Indic script. > An ATR char followed by special codepoint forms the escape sequence. > It is possible to support a page that contains different Indic > scripts.There are > > problems with the standard like, it assumes a default starting language, > which makes sense if the input is from keyboard and language is obtained > from the environment, but notl if data is exchanged between computers. > I don't feel that state shifted code pages are very useful except of transporting blocks of data. When you have state information you can not use standard text manipulation. You notice that there are ISCII hacks for Windows for example but Win2000 uses Unicode. The reason is that if you use something like Mlang to concatenate fonts you have to write in Unicode. It would be a mess to map out which font to used for each character if each font used a different character set. What API would you use that could specify different character sets and code page dependent shifting. It would be a real mess. You would first have to pass it through a layout and then again for display. Unicode simplifies the process. Why develop an Indic only solution? Carl
Re: discontent about Indic scripts and Unicode
Carl, "Carl W. Brown" wrote: > > > Why was really missing was the pint that Unicode is designed to support > multi-lingual text. So is ISCII. Infact Unicode support for Indic scripts is based on ISCII. > If we use a ISCII how can we support a page that > contains different Indic scripts? > ISCII has escape sequences which announce the start of a new Indic script. An ATR char followed by special codepoint forms the escape sequence. It is possible to support a page that contains different Indic scripts.There are problems with the standard like, it assumes a default starting language, which makes sense if the input is from keyboard and language is obtained from the environment, but notl if data is exchanged between computers. Regards Ram Viswanadha begin:vcard n:Viswanadha;Ram x-mozilla-html:FALSE org:IBM;International Components for Unicode adr:;; version:2.1 email;internet:[EMAIL PROTECTED] title:Unicode Software Engineer end:vcard
RE: discontent about Indic scripts and Unicode
Ken, Even those who do not know the details of Indic processing know that you can not argue both sides of the issue. There was a lot of criticism of the fact that there were differences in scripts yet there was no mention that Unicode because of its extended code base does support individualized scripts for the different languages and can accommodate these differences. Why was really missing was the pint that Unicode is designed to support multi-lingual text. If we use a ISCII how can we support a page that contains different Indic scripts? We can also use a common layout engine because it can adapt to script differences. I don't think that they think that there is a keyboarding system that can enter data that will correspond to post layout display text. This was sort of implied as doable but not demonstrated. Carl
Re: discontent about Indic scripts and Unicode
Jarkko reported: > I happened across these links: > > http://acharya.iitm.ac.in/multi_sys/exist_codes.html > http://acharya.iitm.ac.in/multi_sys/uni_iscii.html > > which do contain a nice discussion about ISCII but then they > discuss Unicode in, ummm, somewhat negative terms. > > Myself knowing next to nothing about Indic scripts it would be nice > to hear comments from someone who does know. The Government of India is a member of the Unicode Consortium, and has been engaged in a dialogue with the UTC about a number of perceived problems in the Indic blocks. The UTC just received and is in the process of responding to a long detailed list of perceived problems and suggested improvements. Some of the problems are merely missing characters or misleading or missing annotations. Such problems can be readily fixed. Some of the perceived problems have to do with misconceptions about the relationship between the encoding and collation. That has to be addressed basically by communication and education, and by rolling out working implementations. Some of the perceived problems are just the result of fundamental disagreements about the encoding model, as mentioned by MichKa, particularly for Tamil. On those, we have to agree to disagree, and others may choose to implement local solutions based on non-Unicode encodings. > I do notice some misunderstanding about Unicode in the above links, > quoting from the first one: Disquieting, isn't it, how easy it is for people to misconstrue something they don't understand, and then set up elaborate arguments to critique their misconstruals. --Ken
Re: discontent about Indic scripts and Unicode
This is the same problem that was discussed extensively for Tamil at TI2001 in Kuala Lampur last month. Basically, it boils down to three problems: 1) Most of the people involved do not understand Unicode or how it works. 2) Most of the people involved expect natural language processing to be a feature that any solution ought to support (thus making Unicode inadequate for a valid purpose). 3) The people who do not have problems with #1 or #2 are not as loud as the people who do -- which contributes to the inertia. These two pages conveniently take issues # 1 and #2 and handle them separately. Very thoughtful of the author MichKa Michael Kaplan Trigeminal Software, Inc. http://www.trigeminal.com/ - Original Message - From: "Hietaniemi Jarkko (NRC/Boston)" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Tuesday, September 18, 2001 1:03 PM Subject: discontent about Indic scripts and Unicode > I happened across these links: > > http://acharya.iitm.ac.in/multi_sys/exist_codes.html > http://acharya.iitm.ac.in/multi_sys/uni_iscii.html > > which do contain a nice discussion about ISCII but then they > discuss Unicode in, ummm, somewhat negative terms. > > Myself knowing next to nothing about Indic scripts it would be nice > to hear comments from someone who does know. > > I do notice some misunderstanding about Unicode in the above links, > quoting from the first one: > > > Unicode, besides permitting an 8 bit representation for each language, > adds > > an 8 bit identifier as a most significant byte to make the code 16 > bits. > > Data processing software using Unicode will be able to identify the > Language > > of the text for each character and use appropriate fonts to display > them. > > > > Technically, Unicode can handle 256 different languages but in > practice, > > this number is significantly smaller. Unicode has allowed nearly 24000 > characters > > of Chinese, Japanese and Korean scripts to be included in a single > set. > > Currently fewer than a hundred languages are included in the Unicode. > > > > > Even though it is a sixteen bit code, Unicode usually provides for > about > > 128 characters for each language. > > A messy conflation of "languages" and "characters" and "fonts". Not to > forget > "sixteen bit code". > > The web site has been updated in July. > >