Re: discontent about Indic scripts and Unicode

2001-09-20 Thread Ram Viswanadha



"Carl W. Brown" wrote:


> It looks like ISCII and Unicode are addressing two different multi-lingual
> issues.  Unicode deals with problems like Chinese where you have the same
> writing for different spoken languages.

I dont think so.
ISCII deals with languages that share a "similar"  writing system.

> When it come to using the same
> language or similar language that use different scripts that answer is
> transliteration which is an implementation process that is independent of
> the Unicode encoding.

Indic languages are in general mutually incomphensible making them distinct, the

writing systems are distinct but have similar structure since they originate
from Brahmi script.
IMHO transliteration maybe one of the goals but more importantly the goal of
ISCII was to design
a unified encoding system for Indic Scripts that have a similar structure. That
is the reason why
Perso-Arabic scripts were excluded and a different standard was envisaged.

>
> The ISCII is an attempt to provide cheap transliteration by using the same
> encoding and just changing the font.  You can not do that with Unicode.
> However, I suspect that the transliteration approach will produce better
> results if properly implemented.

Not quite. There are characters in Southern Indic scripts that should be treated
illegal
in Northern Indic scripts and vice versa. ISCII is not clear on this, but ICU
converter treats them as illegal.
The only superset of all Indic scripts in ISCII is Devanagari ( with addition of
characters for Southern scripts, Urdu, English,etc. ). So transliteration
between Gurmukhi (Punjabi) and Telugu will fail if based only on byte values and
these exceptions not are considered.

>
> However, I do not understand the TSCII for Tamil.  Unicode provides the
> script separation that they want.

TSCII is whole different story. I agree with Michka, I too disagree with them.
One amusing comment on TSCII list regarding conversion between Unicode and TSCII
was
"For all practical puposes convserion between TSCII and Unicode is equivalent to
conversion between ISO-8859-1 and Unicode".

Regards,

Ram


begin:vcard 
n:Viswanadha;Ram
x-mozilla-html:FALSE
org:IBM;International Components for Unicode
adr:;;
version:2.1
email;internet:[EMAIL PROTECTED]
title:Unicode Software Engineer
end:vcard



Re: discontent about Indic scripts and Unicode

2001-09-19 Thread Michael \(michka\) Kaplan

From: "Carl W. Brown" <[EMAIL PROTECTED]>

> However, I do not understand the TSCII for Tamil.  Unicode
> provides the script separation that they want.

TSCII is mostly out of favor now (tamil.net being the main exception, and
that only because its webmaster hates all established standards for doing
anything!). The preferred encodings are TAM and TAB.

As to why they prefer TA[M|B] to Unicode, the reasons are many. I happen to
disagree with them, myself, as do many of the members of WG02 of INFITT, but
thats a story for another day, I think? :-)


MichKa

Michael Kaplan
Trigeminal Software, Inc.
http://www.trigeminal.com/








RE: discontent about Indic scripts and Unicode

2001-09-19 Thread Carl W. Brown

Ram,

> ISCII has escape sequences which announce the start of a new Indic script.
> An ATR char  followed by special codepoint  forms the escape sequence.
> It is possible to support a page that contains different Indic
> scripts.There are
>
> problems with the standard like, it assumes a default starting language,
> which makes sense if the input is from keyboard and language is obtained
> from the environment, but notl if data is exchanged between computers.

It looks like ISCII and Unicode are addressing two different multi-lingual
issues.  Unicode deals with problems like Chinese where you have the same
writing for different spoken languages.  When it come to using the same
language or similar language that use different scripts that answer is
transliteration which is an implementation process that is independent of
the Unicode encoding.

The ISCII is an attempt to provide cheap transliteration by using the same
encoding and just changing the font.  You can not do that with Unicode.
However, I suspect that the transliteration approach will produce better
results if properly implemented.

However, I do not understand the TSCII for Tamil.  Unicode provides the
script separation that they want.

Carl





Re: discontent about Indic scripts and Unicode

2001-09-19 Thread James E. Agenbroad

On Wed, 19 Sep 2001, Rick McGowan wrote:

> > If ISCII is still being developed does this suggest that Unicode and its ISO 
> > equivalent move too slowly?
> 
> ISCII dates back to 1988 with a revision in 1990.  It's not "still being  
> developed" -- as far as I know, it's a stable standard that is under  
> routine maintenance.
> 
> I wonder if anyone has yet corresponded with the people who put up the  
> almost unbelievable misconceptions on the two web pages mentioned  
> yesterday?  At least a note could go to the site owners, I would think.
> 
>   Rick
 Wednesday, September 19, 2001
I agree that the 1991 version of ISCII has been stable for representation
of Indian scripts of Indian origin.  I do not know if a standard for
encoding of Perso-Arabic script for Urdu, Sindhi, etc. has advanced beyond
being "envisaged" as mentioned in my earlier note. 
The term ISSCII (for Indian script standard code for information
interchange) dates back at least to the July 1983 report of Government
of India's Sub-Committee on Standardization of Indian Scripts and Their
Codes for for Information Processing entitled "Standardization of Indian
script codes for information interchange. (I'm not sure when the second
'S' was dropped.) Page iv of the 1991 standard is devoted to history and
begins: "Since the 70s, different commitees of the Department of Official
Langauges and the Department of Electronic (DOE) have been evolving
different codes and keyboards which would cater to all the Indian scripts
due to their common phonetic structure."  Besides the 1983 version it
mentions ones of 1986 and 1988.  The 1983 report cites a March 1981
"interim report". 

 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.  





Re: discontent about Indic scripts and Unicode

2001-09-19 Thread Rick McGowan

> If ISCII is still being developed does this suggest that Unicode and its ISO 
> equivalent move too slowly?

ISCII dates back to 1988 with a revision in 1990.  It's not "still being  
developed" -- as far as I know, it's a stable standard that is under  
routine maintenance.

I wonder if anyone has yet corresponded with the people who put up the  
almost unbelievable misconceptions on the two web pages mentioned  
yesterday?  At least a note could go to the site owners, I would think.

Rick




RE: discontent about Indic scripts and Unicode

2001-09-19 Thread James E. Agenbroad

On Wed, 19 Sep 2001, Carl W. Brown wrote:

> Ram,
> 
> If ISCII is intended as a pan-Indic solution does it also support Urdu?
> 
> Carl
> 
  Wednesday, September 19, 2001
No, from the foreword to ISCII: "As Perso-Arabic scripts have a different
alphabet, a different standard is envisaged for them." 

 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.  





Re: discontent about Indic scripts and Unicode

2001-09-19 Thread Charlie Jolly

If ISCII is still being developed does this suggest that Unicode and its ISO
equivalent move too slowly?






RE: discontent about Indic scripts and Unicode

2001-09-19 Thread Carl W. Brown

Ram,

If ISCII is intended as a pan-Indic solution does it also support Urdu?

Carl





RE: discontent about Indic scripts and Unicode

2001-09-18 Thread Carl W. Brown

Ram,


>
> ISCII has escape sequences which announce the start of a new Indic script.
> An ATR char  followed by special codepoint  forms the escape sequence.
> It is possible to support a page that contains different Indic
> scripts.There are
>
> problems with the standard like, it assumes a default starting language,
> which makes sense if the input is from keyboard and language is obtained
> from the environment, but notl if data is exchanged between computers.
>

I don't feel that state shifted code pages are very useful except of
transporting blocks of data.  When you have state information you can not
use standard text manipulation.

You notice that there are ISCII hacks for Windows for example but Win2000
uses Unicode.  The reason is that if you use something like Mlang to
concatenate fonts you have to write in Unicode.

It would be a mess to map out which font to used for each character if each
font used a different character set.  What API would you use that could
specify different character sets and code page dependent shifting.  It would
be a real mess.  You would first have to pass it through a layout and then
again for display.  Unicode simplifies the process.  Why develop an Indic
only solution?

Carl








Re: discontent about Indic scripts and Unicode

2001-09-18 Thread Ram Viswanadha

Carl,

"Carl W. Brown" wrote:

>
>
> Why was really missing was the pint that Unicode is designed to support
> multi-lingual text.

So is ISCII. Infact Unicode support for Indic scripts is based on ISCII.

> If we use a ISCII how can we support a page that
> contains different Indic scripts?
>

ISCII has escape sequences which announce the start of a new Indic script.
An ATR char  followed by special codepoint  forms the escape sequence.
It is possible to support a page that contains different Indic scripts.There are

problems with the standard like, it assumes a default starting language,
which makes sense if the input is from keyboard and language is obtained
from the environment, but notl if data is exchanged between computers.

Regards

Ram Viswanadha


begin:vcard 
n:Viswanadha;Ram
x-mozilla-html:FALSE
org:IBM;International Components for Unicode
adr:;;
version:2.1
email;internet:[EMAIL PROTECTED]
title:Unicode Software Engineer
end:vcard



RE: discontent about Indic scripts and Unicode

2001-09-18 Thread Carl W. Brown

Ken,

Even those who do not know the details of Indic processing know that you can
not argue both sides of the issue.  There was a lot of criticism of the fact
that there were differences in scripts yet there was no mention that Unicode
because of its extended code base does support individualized scripts for
the different languages and can accommodate these differences.

Why was really missing was the pint that Unicode is designed to support
multi-lingual text.  If we use a ISCII how can we support a page that
contains different Indic scripts?

We can also use a common layout engine because it can adapt to script
differences.  I don't think that they think that there is a keyboarding
system that can enter data that will correspond to post layout display text.
This was sort of implied as doable but not demonstrated.

Carl






Re: discontent about Indic scripts and Unicode

2001-09-18 Thread Kenneth Whistler

Jarkko reported:

> I happened across these links:
> 
> http://acharya.iitm.ac.in/multi_sys/exist_codes.html
> http://acharya.iitm.ac.in/multi_sys/uni_iscii.html
> 
> which do contain a nice discussion about ISCII but then they
> discuss Unicode in, ummm, somewhat negative terms.
> 
> Myself knowing next to nothing about Indic scripts it would be nice
> to hear comments from someone who does know.

The Government of India is a member of the Unicode Consortium,
and has been engaged in a dialogue with the UTC about a number
of perceived problems in the Indic blocks. The UTC just received
and is in the process of responding to a long detailed list of
perceived problems and suggested improvements.

Some of the problems are merely missing characters or misleading
or missing annotations. Such problems can be readily fixed.

Some of the perceived problems have to do with misconceptions
about the relationship between the encoding and collation.
That has to be addressed basically by communication and
education, and by rolling out working implementations.

Some of the perceived problems are just the result of
fundamental disagreements about the encoding model, as mentioned
by MichKa, particularly for Tamil. On those, we have to agree
to disagree, and others may choose to implement local solutions
based on non-Unicode encodings.

> I do notice some misunderstanding about Unicode in the above links,
> quoting from the first one:

Disquieting, isn't it, how easy it is for people to misconstrue
something they don't understand, and then set up elaborate
arguments to critique their misconstruals.

--Ken




Re: discontent about Indic scripts and Unicode

2001-09-18 Thread Michael \(michka\) Kaplan

This is the same problem that was discussed extensively for Tamil at TI2001
in Kuala Lampur last month. Basically, it boils down to three problems:

1) Most of the people involved do not understand Unicode or how it works.
2) Most of the people involved expect natural language processing to be a
feature that any solution ought to support (thus making Unicode inadequate
for a valid purpose).
3) The people who do not have problems with #1 or #2 are not as loud as the
people who do -- which contributes to the inertia.

These two pages conveniently take issues # 1 and #2 and handle them
separately. Very thoughtful of the author

MichKa

Michael Kaplan
Trigeminal Software, Inc.
http://www.trigeminal.com/

- Original Message -
From: "Hietaniemi Jarkko (NRC/Boston)" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Tuesday, September 18, 2001 1:03 PM
Subject: discontent about Indic scripts and Unicode


> I happened across these links:
>
> http://acharya.iitm.ac.in/multi_sys/exist_codes.html
> http://acharya.iitm.ac.in/multi_sys/uni_iscii.html
>
> which do contain a nice discussion about ISCII but then they
> discuss Unicode in, ummm, somewhat negative terms.
>
> Myself knowing next to nothing about Indic scripts it would be nice
> to hear comments from someone who does know.
>
> I do notice some misunderstanding about Unicode in the above links,
> quoting from the first one:
>
> > Unicode, besides permitting an 8 bit representation for each language,
> adds
> > an 8 bit identifier as a most significant byte to make the  code 16
> bits.
> > Data processing software using Unicode will be able to identify the
> Language
> > of the text for each character and use appropriate fonts to display
> them.
> >
> > Technically, Unicode can handle 256 different languages but in
> practice,
> > this number is significantly smaller. Unicode has allowed nearly 24000
> characters
> > of Chinese, Japanese and Korean scripts to be included in a single
> set.
> > Currently fewer than a hundred languages are included in the Unicode.
>
> >
> > Even though it is a sixteen bit code, Unicode usually provides for
> about
> > 128 characters for each language.
>
> A messy conflation of "languages" and "characters" and "fonts".  Not to
> forget
> "sixteen bit code".
>
> The web site has been updated in July.
>
>