Re: Devanagari

2002-01-20 Thread James Kass


Aman Chawla wrote,



 I would be grateful if I could get opinions on the following:

 1. Which encoding/character set is most suitable for using Hindi/Marathi
 (both of which use Devanagari) on the internet as well as in databases, and
 why? In your response, please refer to:
 http://www.iiit.net/ltrc/Publications/iscii_plugin_display.html,
 particularly the following paragraphs:
snip

Unicode is the best.  It is the World's standard for computer encoding, and,
as such, offers the best possibility that text can be exchanged around the
globe and cross-platform.

The arguments about relative size are true, but in this day and age are
considered unimportant.  Graphics files are extremely large in comparison
with text files of any script and so are sound files.  Devanagari UTF-8 is
three bytes.  The four byte UTF-8 sequences so far are only used for
Plane One Unicode and up.

 3. With reference to the previous question, can programs that convert
 the myriad Devangari encodings in use today to a standard encoding
 (question 1) be made freely available, and how?

Yes, converters exist and are being distributed.  Just go to the Google
search engine and input character conversion Unicode into the box.
Look for ICU and Rosette, to name a few.  You might even run across
Mark Leisher's download page at:
  http://crl.nmsu.edu/~mleisher/download.html
and see the PERL script for converting the Naidunia Devanagari encoding
to UTF-16.

 4. Is there any search engine on the internet that maintains an up to date
 index of sites in Devanagari? If not, what can be done to encourage
 proprietary search engines to support Hindi? Google supposedly has a
 Hindi language option, but surprise, it's in Roman script! Several emails
 to them have elicited the response: At the moment we don't support
 Devanagari...

This appears to be because Google is converting UTF-8 strings input
to the search words box into decimal NCRs.

Pasted यूनिकोड क्या है into the Google box, it displays 
fine.  Since the
What is Unicode? pages are popular and have been up for a while,
thought that it would have a good chance of being indexed.  But,
there were no hits for the resulting search string:
#2351;#2370;#2344;#2367;#2325;#2379;#2337;
#2325;#2381;#2351;#2366; #2361;#2376;
...which is not surprising since the actual page doesn't use NCRs.

Best regards,

James Kass.








Devanagari Rupee Symbol

2002-01-20 Thread Aman Chawla



I am unable to find the Devanagari Rupee sign 
encoded in Unicode? Is it encoded? If not, why?



Re: The benefit of a symbol for 2 pi

2002-01-20 Thread Robert Palais


Hi James,

   I appreciate the research, and the humor! 2 pis = peace, eh?
(not on the unicode list! :-) but I like that especially since
the issue of a name has been problematic. e to the i peace =1
circumference = peace times r, integral from zero to peace,
period = peace over frequency, has a nice ring to it!

Peace,
Bob

On Sat, 19 Jan 2002, James Kass wrote:

 Couldn't find such a glyph, but there are some that are vaguely
 similar:
 
 U+29B7 CIRCLED PARALLEL
 U+238B BROKEN CIRCLE WITH NORTHWEST ARROW
 U+229D CIRCLED DASH
 
 And, with a note of mild humor, possibly the PEACE SIGN at U+262E
 might serve?  It doesn't seem to be used much these days.  With the
 vagaries of English plurals, perhaps peace is the proper plural for pi...
 
 Best regards,
 
 James Kass,
 
 P.S. - With so-called smart fonts, which are really just OpenType fonts,
 a string such as the digit two followed by the Greek pi could be replaced
 in the display with a special glyph for 2pi or newpi.  This would not
 alter the original file, it only impacts the display.  The procedure is
 called glyph substitution and support for OpenType is growing.
 
 






Re: Devanagari Rupee Symbol

2002-01-20 Thread Michael Everson

At 11:22 -0500 2002-01-20, Aman Chawla wrote:
I am unable to find the Devanagari Rupee sign encoded in Unicode? Is 
it encoded? If not, why?


U+20A8.
-- 
Michael Everson *** Everson Typography *** http://www.evertype.com




Working with a Unicode terminal

2002-01-20 Thread Alon Dashut

Hello.
My current application works with 'Windows HyperTerminal' using an RS-232
cable from the client machine to the server machine (that means that the
terminal is running on the client).
Until today, the terminal sent 8bit (char) characters from the client to the
server. 
Now I have a need to send Unicode characters (in order to support languages
other than English). Can I still use Windows HyperTerminal? How do I read
such characters on the server machine? When using chars, I would read byte
by byte. 
Is there any example code that I can see how Unicode characters are read?

Any help will be welcome.
Thanks, Dashut Alon.




Re: Devanagari

2002-01-20 Thread Asmus Freytag

At 12:48 AM 1/20/02 -0800, James Kass wrote:
The arguments about relative size are true, but in this day and age are
considered unimportant.  Graphics files are extremely large in comparison
with text files of any script and so are sound files.  Devanagari UTF-8 is
three bytes.  The four byte UTF-8 sequences so far are only used for
Plane One Unicode and up.

If the argument refers to 4-byte sequences for Devanagari, it is not
factually 'true', as James points out.

More to the point is the following observation: HTML or similar mark-up
languages account for an ever growing percentage of transmission of
text - even in e-mail.

The fact that UTF-8 economizes on the storage for ASCII characters, is a
benefit for *all* HTML users, as the HTML syntax is entirely in ASCII and
claims a significant fraction of the data.

A UTF-8 encoded HTML file, will therefore have (percentage-wise) less overhead
for Devanagari as claimed. Add to that James' observation on graphics files,
many of which accompany even the simplest HTML documents and you get a
percentage difference between the sizes of an English and Devanagari website
(i.e. in its entirety) that's well within the fluctuation of the typical
length in characters, for expressing the same concept in different languages.

In other words, contrary to the claims made by the argument, it is hard to
predict that this structure of UTF-8 will have an observable impact on
exchanging data - other than psychological perhaps.

In many size constrained application areas it may pay off to do compression.
http://www.unicode.org/unicode/reports/tr6 shows how one can compress
Unicode Data in Devanagari to a size comparable to that of 8-bit ISCII.
However, interchange of this format (SCSU) requires consenting parties.

A./




Re: Devanagari

2002-01-20 Thread Aman Chawla

 The fact that UTF-8 economizes on the storage for ASCII characters, is a
 benefit for *all* HTML users, as the HTML syntax is entirely in ASCII and
 claims a significant fraction of the data.

 A UTF-8 encoded HTML file, will therefore have (percentage-wise) less
overhead
 for Devanagari as claimed. Add to that James' observation on graphics
files,
 many of which accompany even the simplest HTML documents and you get a
 percentage difference between the sizes of an English and Devanagari
website
 (i.e. in its entirety) that's well within the fluctuation of the typical
 length in characters, for expressing the same concept in different
languages.

The point was that a UTF-8 encoded HTML file for an English web page
carrying say 10 gifs would have a file size one-third that for a Devanagari
web page with the same no. of gifs - even if you take into account the
fluctuation of the typical length in characters, for expressing the same
concept in different languages. This is because in some cases one language
may express a concept more compactly while in other cases it may not, and on
the whole this effect would balance out and can therefore be neglected.
Therefore transmission of a Devanagari web page over a network would take
thrice as long as that of an English web page using the same images and
presenting the same information.






Re: Devanagari

2002-01-20 Thread DougEwell2

In a message dated 2002-01-20 16:49:17 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

 The point was that a UTF-8 encoded HTML file for an English web page
 carrying say 10 gifs would have a file size one-third that for a Devanagari
 web page with the same no. of gifs...
 Therefore transmission of a Devanagari web page over a network would take
 thrice as long as that of an English web page using the same images and
 presenting the same information.

This conclusion ignores two obvious points, which Asmus already made:

(1) The 10 GIFs, each of which may well be larger than the HTML file, take 
the same amount of space regardless of the encoding of the HTML file.  The 
total number of bytes involved in transmitting a Web page includes 
everything, HTML and graphics, but the purported factor of 3 applies only 
to the HTML.

(2) The markup in an HTML file, which comprises a significant portion of the 
file, is all ASCII.  So the factor of 3 doesn't even apply to the entire 
HTML file, only the plain-text content portion.

In addition, text written in Devanagari includes plenty of instances of 
U+0020 SPACE, plus CR and/or LF, each of which which occupies one byte each 
regardless of the encoding.

I think before worrying about the performance and storage effect on Web pages 
due to UTF-8, it might help to do some profiling and see what the actual 
impact is.

-Doug Ewell
 Fullerton, California




Re: Devanagari

2002-01-20 Thread Christopher Vance

On Sun, Jan 20, 2002 at 07:39:57PM -0500, Aman Chawla wrote:
: The point was that a UTF-8 encoded HTML file for an English web page
: carrying say 10 gifs would have a file size one-third that for a Devanagari
: web page with the same no. of gifs - even if you take into account the
: fluctuation of the typical length in characters, for expressing the same
: concept in different languages. This is because in some cases one language
: may express a concept more compactly while in other cases it may not, and on
: the whole this effect would balance out and can therefore be neglected.
: Therefore transmission of a Devanagari web page over a network would take
: thrice as long as that of an English web page using the same images and
: presenting the same information.

And the whole UTF-8 Devanagari page is probably still smaller than
even one of the .gif files.

-- 
Christopher Vance




Re: Devanagari

2002-01-20 Thread David Starner

On Sun, Jan 20, 2002 at 07:39:57PM -0500, Aman Chawla wrote:
 The point was that a UTF-8 encoded HTML file for an English web page
 carrying say 10 gifs would have a file size one-third that for a Devanagari
 web page with the same no. of gifs 

The point is, that the text for a short webpage is 10k for English and
30k for Devanagari, the HTML will be another 10k for English and another
10k for Devanagari, and the graphics will another 30k for English and
another 30k for Devanagari, meaning that the total will be 50k for
English and 70k for Devanagari - 40% markup, not 200%. Adding a 150k
graphic would make it 200k for English and 220k for Devangari, making it
a 10% markup. 

-- 
David Starner - [EMAIL PROTECTED], dvdeug/jabber.com (Jabber)
Pointless website: http://dvdeug.dhis.org
When the aliens come, when the deathrays hum, when the bombers bomb,
we'll still be freakin' friends. - Freakin' Friends




FrameMaker+SGML 6.0, Content Management and Unicode

2002-01-20 Thread Patrick Andries

Does someone know if Framemaker+SGML 6.0 supports Unicode?

If not, do standard content management tools such as FrameLink do the 
conversion to Unicode before storing their data in their repository 
(Documentum for instance)?

If you know of such solutions, I would love to hear from you.

Patrick Andries










Re: Devanagari

2002-01-20 Thread James Kass


Doug Ewell wrote,

 
 I think before worrying about the performance and storage effect on Web pages 
 due to UTF-8, it might help to do some profiling and see what the actual 
 impact is.
 

The What is Unicode? pages offer a quick study.

14808 bytes (English)
15218 bytes (Hindi)
10808 bytes (Danish)
11281 bytes (French)
 9682 bytes (Chinese Trad.)

(The English page includes links to all the other scripts, but the individual
script pages only link back to the English page.  So, the English page is a
bit larger than the other pages for this reason, not a fair test if we only
count the English and Hindi pages.)

The Unicode logo gif at the top left corner of each of these pages takes
 bytes.  A screen shot of the beginning of the Hindi page takes
37569 bytes as a gif, the small portion cropped and attached takes
4939 bytes.

The What is Unicode? pages are at:
http://www.unicode.org/unicode/standard/WhatIsUnicode.html

Best regards,

James Kass.




hindiwhatis.gif
Description: GIF image


Re: Devanagari

2002-01-20 Thread Barry Caplan

At 10:44 PM 1/20/2002 -0500, you wrote:
Taking the extra links into account the sizes are:
English: 10.4 Kb
Devanagari: 15.0 Kb
Thus the Dev. page is 1.44 times the Eng. page. For sites providing archives
of documents/manuscripts (in plain text) in Devanagari, this factor could be
as high as approx. 3 using UTF-8 and around 1 using ISCII.


Yes, but that is this page only. Are you suggesting that all pages will 
vary by that factor? Of course not.

Please consider whether the space *in practice* is a limiting factor. It 
seems that folks on the list feel it is not. Not for bandwidth limited 
applications, and not for disk space limited applications.

The amount of space devoted to plain text of any language on a typical web 
page is microscopic compared tot he markup, images, sounds, and other files 
also associated with the web page.

Are you suggesting that utf-8 ought to have been optimized for Devanagari text?

Barry Caplan
www.i18n.com -- coming soon...






Re: Devanagari

2002-01-20 Thread Aman Chawla

- Original Message -
From: James Kass [EMAIL PROTECTED]
To: Aman Chawla [EMAIL PROTECTED]; Unicode
[EMAIL PROTECTED]
Sent: Monday, January 21, 2002 12:46 AM
Subject: Re: Devanagari


 25% may not be 300%, but it isn't insignificant.  As you note, if the
 mark-up were removed from both of those files, the percentage of
 increase would be slightly higher.  But, as connection speeds continue
 to improve, these differences are becoming almost minuscule.

With regards to South Asia, where the most widely used modems are approx. 14
kbps, maybe some 36 kbps and rarely 56 kbps, where broadband/DSL is mostly
unheard of, efficiency in data transmission is of paramount importance...
how can we convince the south asian user to create websites in an encoding
that would make his client's 14 kbps modem as effective (rather,
ineffective) as a 4.6 kbps modem?





Re: Devanagari

2002-01-20 Thread David Starner

On Sun, Jan 20, 2002 at 10:44:00PM -0500, Aman Chawla wrote:
 For sites providing archives
 of documents/manuscripts (in plain text) in Devanagari, this factor could be
 as high as approx. 3 using UTF-8 and around 1 using ISCII.

Uncompressed, yes. It shouldn't be nearly as bad compressed - gzip, zip,
bzip2, or whatever your favorite tool is. You could also use UTF-16 or
SCSU, which will get it down to about 2 or about 1, respectively.

What's your point in continuing this? Most of the people on this list
already know how UTF-8 can expand the size of non-English text. There's
nothing we can do about it. Even if you had brought it up when UTF-8
was being designed, there's not much anyone could have done about it.
There is no simple encoding scheme that will encode Indic text in
Unicode in one byte per character. 

It's the pigeonhole principle in action - if you need to encode 150,000
characters, you can't encode each one in one or two bytes, and while you
can write encodings that approach that for normal text, they aren't
going to be simple or pretty.

-- 
David Starner - [EMAIL PROTECTED], dvdeug/jabber.com (Jabber)
Pointless website: http://dvdeug.dhis.org
When the aliens come, when the deathrays hum, when the bombers bomb,
we'll still be freakin' friends. - Freakin' Friends




Re: Devanagari

2002-01-20 Thread Aman Chawla



- Original 
Message -From: "David Starner" [EMAIL PROTECTED]To: "Aman Chawla" [EMAIL PROTECTED]Cc: "James Kass" [EMAIL PROTECTED]; "Unicode"[EMAIL PROTECTED]Sent: Monday, January 21, 2002 12:19 
AMSubject: Re: Devanagari What's your point in continuing 
this? Most of the people on this list already know how UTF-8 can expand 
the size of non-English text.The issue was originally brought up to 
gather opinion from members of thislist as to whether UTF-8 or ISCII should 
be used for creating Devanagari webpages. The point is not to criticise 
Unicode but to gather opinions ofinformed persons (list members) and 
determine what is thebest encodingfor informationinterchange in 
South-Asian scripts...


Re: Devanagari

2002-01-20 Thread Geoffrey Waigh

On Sun, 20 Jan 2002, Aman Chawla wrote:

 Taking the extra links into account the sizes are:
 English: 10.4 Kb
 Devanagari: 15.0 Kb
 Thus the Dev. page is 1.44 times the Eng. page. For sites providing archives
 of documents/manuscripts (in plain text) in Devanagari, this factor could be
 as high as approx. 3 using UTF-8 and around 1 using ISCII.

Well a trivial adjustment is to use UTF-16 to store your documents if you
know they are going to be predominantly Devangari.  Or if you have so much
text that the number of extra disks is going to be painful, use SCSU to
bring it very close to the ISCII ratio.  Of course I would note that you
can store millions of pages of plain-text on a single harddisk these
days.  If you going to be storing so many hundreds of millions of pages of
plain text that the number of extra disks is a bother, I am amazed that
none of it might be outside the ISCII repetoire.  And this huge document
archive has no graphics component to go with it...

But the real reason for publishing the data in Unicode on the web is so
people not using a machine specially configured for ISCII will still be
able to read and process the data.

[then later wrote:]

 With regards to South Asia, where the most widely used modems are
 approx. 14 kbps, maybe some 36 kbps and rarely 56 kbps, where
 broadband/DSL is mostly unheard of, efficiency in data transmission is
 of paramount importance... how can we convince the south asian user to
 create websites in an encoding that would make his client's 14 kbps
 modem as effective (rather, ineffective) as a 4.6 kbps modem?

Can you read 500 characters per second?  So long as they are receiving
only plain text, even this dwaddling speed is not going to impact them.
People wanting to efficiently transfer data will use a compression
program.

Geoffrey








Re: Devanagari

2002-01-20 Thread DougEwell2

In a message dated 2002-01-20 20:49:00 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

 Usually, when someone offers
 a large body of plain text in any script, files are compressed 
 in one way or another in order to speed up downloads.

This is why I really wish that SCSU were considered a truly standard 
encoding scheme.  Even among the Unicode cognoscenti it is usually 
accompanied by disclaimers about private agreement only and not suitable 
for use on the Internet, where the former claim is only true because of the 
self-perpetuating obscurity of SCSU and the latter seems completely 
unjustified.

Devanagari text encoded in SCSU occupies exactly 1 byte per character, plus 
an additional byte near the start of the file to set the current window (0x14 
= SC4).

-Doug Ewell
 Fullerton, California




Re: Devanagari

2002-01-20 Thread DougEwell2

In a message dated 2002-01-20 21:49:02 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

 The issue was originally brought up to gather opinion from members of this
 list as to whether UTF-8 or ISCII should be used for creating Devanagari web
 pages. The point is not to criticise Unicode but to gather opinions of
 informed persons (list members) and determine what is the best encoding for 
 information interchange in South-Asian scripts...

It seems that the only point against Unicode compared to ISCII is the 
resulting document size in bytes, and this one point is being given 100% 
focus in the comparison.

If the actual question is, What is the most efficient encoding for 
Devanagari text, in terms of bytes, using only the most commonly encountered 
encoding schemes and no external compression? then of course you will have 
loaded the question in favor of ISCII.

But when you consider that more browsers today around the world (not just in 
India) are equipped to handle Unicode than ISCII, and that Unicode allows not 
only the encoding of ASCII and Devanagari but the full complement of Indic 
scripts (Oriya, Gujarati, Tamil...) as well as any other script on the planet 
that you could realistically want to encode, you will probably have to 
rethink the cost/benefit tradeoff of Unicode.

-Doug Ewell
 Fullerton, California




Re: Devanagari

2002-01-20 Thread David Starner

On Mon, Jan 21, 2002 at 12:57:39AM -0500, [EMAIL PROTECTED] wrote:
 This is why I really wish that SCSU were considered a truly standard 
 encoding scheme.  Even among the Unicode cognoscenti it is usually 
 accompanied by disclaimers about private agreement only and not suitable 
 for use on the Internet, where the former claim is only true because of the 
 self-perpetuating obscurity of SCSU and the latter seems completely 
 unjustified.

Does Mozilla support it? If someone's willing to spend a little time,
adding it to Mozilla is one way to make it more generally useable. And
maybe then IE will get nudged into playing a little catchup . . .

-- 
David Starner - [EMAIL PROTECTED], dvdeug/jabber.com (Jabber)
Pointless website: http://dvdeug.dhis.org
When the aliens come, when the deathrays hum, when the bombers bomb,
we'll still be freakin' friends. - Freakin' Friends