Re: Devanagari
Aman Chawla wrote, I would be grateful if I could get opinions on the following: 1. Which encoding/character set is most suitable for using Hindi/Marathi (both of which use Devanagari) on the internet as well as in databases, and why? In your response, please refer to: http://www.iiit.net/ltrc/Publications/iscii_plugin_display.html, particularly the following paragraphs: snip Unicode is the best. It is the World's standard for computer encoding, and, as such, offers the best possibility that text can be exchanged around the globe and cross-platform. The arguments about relative size are true, but in this day and age are considered unimportant. Graphics files are extremely large in comparison with text files of any script and so are sound files. Devanagari UTF-8 is three bytes. The four byte UTF-8 sequences so far are only used for Plane One Unicode and up. 3. With reference to the previous question, can programs that convert the myriad Devangari encodings in use today to a standard encoding (question 1) be made freely available, and how? Yes, converters exist and are being distributed. Just go to the Google search engine and input character conversion Unicode into the box. Look for ICU and Rosette, to name a few. You might even run across Mark Leisher's download page at: http://crl.nmsu.edu/~mleisher/download.html and see the PERL script for converting the Naidunia Devanagari encoding to UTF-16. 4. Is there any search engine on the internet that maintains an up to date index of sites in Devanagari? If not, what can be done to encourage proprietary search engines to support Hindi? Google supposedly has a Hindi language option, but surprise, it's in Roman script! Several emails to them have elicited the response: At the moment we don't support Devanagari... This appears to be because Google is converting UTF-8 strings input to the search words box into decimal NCRs. Pasted यूनिकोड क्या है into the Google box, it displays fine. Since the What is Unicode? pages are popular and have been up for a while, thought that it would have a good chance of being indexed. But, there were no hits for the resulting search string: #2351;#2370;#2344;#2367;#2325;#2379;#2337; #2325;#2381;#2351;#2366; #2361;#2376; ...which is not surprising since the actual page doesn't use NCRs. Best regards, James Kass.
Devanagari Rupee Symbol
I am unable to find the Devanagari Rupee sign encoded in Unicode? Is it encoded? If not, why?
Re: The benefit of a symbol for 2 pi
Hi James, I appreciate the research, and the humor! 2 pis = peace, eh? (not on the unicode list! :-) but I like that especially since the issue of a name has been problematic. e to the i peace =1 circumference = peace times r, integral from zero to peace, period = peace over frequency, has a nice ring to it! Peace, Bob On Sat, 19 Jan 2002, James Kass wrote: Couldn't find such a glyph, but there are some that are vaguely similar: U+29B7 CIRCLED PARALLEL U+238B BROKEN CIRCLE WITH NORTHWEST ARROW U+229D CIRCLED DASH And, with a note of mild humor, possibly the PEACE SIGN at U+262E might serve? It doesn't seem to be used much these days. With the vagaries of English plurals, perhaps peace is the proper plural for pi... Best regards, James Kass, P.S. - With so-called smart fonts, which are really just OpenType fonts, a string such as the digit two followed by the Greek pi could be replaced in the display with a special glyph for 2pi or newpi. This would not alter the original file, it only impacts the display. The procedure is called glyph substitution and support for OpenType is growing.
Re: Devanagari Rupee Symbol
At 11:22 -0500 2002-01-20, Aman Chawla wrote: I am unable to find the Devanagari Rupee sign encoded in Unicode? Is it encoded? If not, why? U+20A8. -- Michael Everson *** Everson Typography *** http://www.evertype.com
Working with a Unicode terminal
Hello. My current application works with 'Windows HyperTerminal' using an RS-232 cable from the client machine to the server machine (that means that the terminal is running on the client). Until today, the terminal sent 8bit (char) characters from the client to the server. Now I have a need to send Unicode characters (in order to support languages other than English). Can I still use Windows HyperTerminal? How do I read such characters on the server machine? When using chars, I would read byte by byte. Is there any example code that I can see how Unicode characters are read? Any help will be welcome. Thanks, Dashut Alon.
Re: Devanagari
At 12:48 AM 1/20/02 -0800, James Kass wrote: The arguments about relative size are true, but in this day and age are considered unimportant. Graphics files are extremely large in comparison with text files of any script and so are sound files. Devanagari UTF-8 is three bytes. The four byte UTF-8 sequences so far are only used for Plane One Unicode and up. If the argument refers to 4-byte sequences for Devanagari, it is not factually 'true', as James points out. More to the point is the following observation: HTML or similar mark-up languages account for an ever growing percentage of transmission of text - even in e-mail. The fact that UTF-8 economizes on the storage for ASCII characters, is a benefit for *all* HTML users, as the HTML syntax is entirely in ASCII and claims a significant fraction of the data. A UTF-8 encoded HTML file, will therefore have (percentage-wise) less overhead for Devanagari as claimed. Add to that James' observation on graphics files, many of which accompany even the simplest HTML documents and you get a percentage difference between the sizes of an English and Devanagari website (i.e. in its entirety) that's well within the fluctuation of the typical length in characters, for expressing the same concept in different languages. In other words, contrary to the claims made by the argument, it is hard to predict that this structure of UTF-8 will have an observable impact on exchanging data - other than psychological perhaps. In many size constrained application areas it may pay off to do compression. http://www.unicode.org/unicode/reports/tr6 shows how one can compress Unicode Data in Devanagari to a size comparable to that of 8-bit ISCII. However, interchange of this format (SCSU) requires consenting parties. A./
Re: Devanagari
The fact that UTF-8 economizes on the storage for ASCII characters, is a benefit for *all* HTML users, as the HTML syntax is entirely in ASCII and claims a significant fraction of the data. A UTF-8 encoded HTML file, will therefore have (percentage-wise) less overhead for Devanagari as claimed. Add to that James' observation on graphics files, many of which accompany even the simplest HTML documents and you get a percentage difference between the sizes of an English and Devanagari website (i.e. in its entirety) that's well within the fluctuation of the typical length in characters, for expressing the same concept in different languages. The point was that a UTF-8 encoded HTML file for an English web page carrying say 10 gifs would have a file size one-third that for a Devanagari web page with the same no. of gifs - even if you take into account the fluctuation of the typical length in characters, for expressing the same concept in different languages. This is because in some cases one language may express a concept more compactly while in other cases it may not, and on the whole this effect would balance out and can therefore be neglected. Therefore transmission of a Devanagari web page over a network would take thrice as long as that of an English web page using the same images and presenting the same information.
Re: Devanagari
In a message dated 2002-01-20 16:49:17 Pacific Standard Time, [EMAIL PROTECTED] writes: The point was that a UTF-8 encoded HTML file for an English web page carrying say 10 gifs would have a file size one-third that for a Devanagari web page with the same no. of gifs... Therefore transmission of a Devanagari web page over a network would take thrice as long as that of an English web page using the same images and presenting the same information. This conclusion ignores two obvious points, which Asmus already made: (1) The 10 GIFs, each of which may well be larger than the HTML file, take the same amount of space regardless of the encoding of the HTML file. The total number of bytes involved in transmitting a Web page includes everything, HTML and graphics, but the purported factor of 3 applies only to the HTML. (2) The markup in an HTML file, which comprises a significant portion of the file, is all ASCII. So the factor of 3 doesn't even apply to the entire HTML file, only the plain-text content portion. In addition, text written in Devanagari includes plenty of instances of U+0020 SPACE, plus CR and/or LF, each of which which occupies one byte each regardless of the encoding. I think before worrying about the performance and storage effect on Web pages due to UTF-8, it might help to do some profiling and see what the actual impact is. -Doug Ewell Fullerton, California
Re: Devanagari
On Sun, Jan 20, 2002 at 07:39:57PM -0500, Aman Chawla wrote: : The point was that a UTF-8 encoded HTML file for an English web page : carrying say 10 gifs would have a file size one-third that for a Devanagari : web page with the same no. of gifs - even if you take into account the : fluctuation of the typical length in characters, for expressing the same : concept in different languages. This is because in some cases one language : may express a concept more compactly while in other cases it may not, and on : the whole this effect would balance out and can therefore be neglected. : Therefore transmission of a Devanagari web page over a network would take : thrice as long as that of an English web page using the same images and : presenting the same information. And the whole UTF-8 Devanagari page is probably still smaller than even one of the .gif files. -- Christopher Vance
Re: Devanagari
On Sun, Jan 20, 2002 at 07:39:57PM -0500, Aman Chawla wrote: The point was that a UTF-8 encoded HTML file for an English web page carrying say 10 gifs would have a file size one-third that for a Devanagari web page with the same no. of gifs The point is, that the text for a short webpage is 10k for English and 30k for Devanagari, the HTML will be another 10k for English and another 10k for Devanagari, and the graphics will another 30k for English and another 30k for Devanagari, meaning that the total will be 50k for English and 70k for Devanagari - 40% markup, not 200%. Adding a 150k graphic would make it 200k for English and 220k for Devangari, making it a 10% markup. -- David Starner - [EMAIL PROTECTED], dvdeug/jabber.com (Jabber) Pointless website: http://dvdeug.dhis.org When the aliens come, when the deathrays hum, when the bombers bomb, we'll still be freakin' friends. - Freakin' Friends
FrameMaker+SGML 6.0, Content Management and Unicode
Does someone know if Framemaker+SGML 6.0 supports Unicode? If not, do standard content management tools such as FrameLink do the conversion to Unicode before storing their data in their repository (Documentum for instance)? If you know of such solutions, I would love to hear from you. Patrick Andries
Re: Devanagari
Doug Ewell wrote, I think before worrying about the performance and storage effect on Web pages due to UTF-8, it might help to do some profiling and see what the actual impact is. The What is Unicode? pages offer a quick study. 14808 bytes (English) 15218 bytes (Hindi) 10808 bytes (Danish) 11281 bytes (French) 9682 bytes (Chinese Trad.) (The English page includes links to all the other scripts, but the individual script pages only link back to the English page. So, the English page is a bit larger than the other pages for this reason, not a fair test if we only count the English and Hindi pages.) The Unicode logo gif at the top left corner of each of these pages takes bytes. A screen shot of the beginning of the Hindi page takes 37569 bytes as a gif, the small portion cropped and attached takes 4939 bytes. The What is Unicode? pages are at: http://www.unicode.org/unicode/standard/WhatIsUnicode.html Best regards, James Kass. hindiwhatis.gif Description: GIF image
Re: Devanagari
At 10:44 PM 1/20/2002 -0500, you wrote: Taking the extra links into account the sizes are: English: 10.4 Kb Devanagari: 15.0 Kb Thus the Dev. page is 1.44 times the Eng. page. For sites providing archives of documents/manuscripts (in plain text) in Devanagari, this factor could be as high as approx. 3 using UTF-8 and around 1 using ISCII. Yes, but that is this page only. Are you suggesting that all pages will vary by that factor? Of course not. Please consider whether the space *in practice* is a limiting factor. It seems that folks on the list feel it is not. Not for bandwidth limited applications, and not for disk space limited applications. The amount of space devoted to plain text of any language on a typical web page is microscopic compared tot he markup, images, sounds, and other files also associated with the web page. Are you suggesting that utf-8 ought to have been optimized for Devanagari text? Barry Caplan www.i18n.com -- coming soon...
Re: Devanagari
- Original Message - From: James Kass [EMAIL PROTECTED] To: Aman Chawla [EMAIL PROTECTED]; Unicode [EMAIL PROTECTED] Sent: Monday, January 21, 2002 12:46 AM Subject: Re: Devanagari 25% may not be 300%, but it isn't insignificant. As you note, if the mark-up were removed from both of those files, the percentage of increase would be slightly higher. But, as connection speeds continue to improve, these differences are becoming almost minuscule. With regards to South Asia, where the most widely used modems are approx. 14 kbps, maybe some 36 kbps and rarely 56 kbps, where broadband/DSL is mostly unheard of, efficiency in data transmission is of paramount importance... how can we convince the south asian user to create websites in an encoding that would make his client's 14 kbps modem as effective (rather, ineffective) as a 4.6 kbps modem?
Re: Devanagari
On Sun, Jan 20, 2002 at 10:44:00PM -0500, Aman Chawla wrote: For sites providing archives of documents/manuscripts (in plain text) in Devanagari, this factor could be as high as approx. 3 using UTF-8 and around 1 using ISCII. Uncompressed, yes. It shouldn't be nearly as bad compressed - gzip, zip, bzip2, or whatever your favorite tool is. You could also use UTF-16 or SCSU, which will get it down to about 2 or about 1, respectively. What's your point in continuing this? Most of the people on this list already know how UTF-8 can expand the size of non-English text. There's nothing we can do about it. Even if you had brought it up when UTF-8 was being designed, there's not much anyone could have done about it. There is no simple encoding scheme that will encode Indic text in Unicode in one byte per character. It's the pigeonhole principle in action - if you need to encode 150,000 characters, you can't encode each one in one or two bytes, and while you can write encodings that approach that for normal text, they aren't going to be simple or pretty. -- David Starner - [EMAIL PROTECTED], dvdeug/jabber.com (Jabber) Pointless website: http://dvdeug.dhis.org When the aliens come, when the deathrays hum, when the bombers bomb, we'll still be freakin' friends. - Freakin' Friends
Re: Devanagari
- Original Message -From: "David Starner" [EMAIL PROTECTED]To: "Aman Chawla" [EMAIL PROTECTED]Cc: "James Kass" [EMAIL PROTECTED]; "Unicode"[EMAIL PROTECTED]Sent: Monday, January 21, 2002 12:19 AMSubject: Re: Devanagari What's your point in continuing this? Most of the people on this list already know how UTF-8 can expand the size of non-English text.The issue was originally brought up to gather opinion from members of thislist as to whether UTF-8 or ISCII should be used for creating Devanagari webpages. The point is not to criticise Unicode but to gather opinions ofinformed persons (list members) and determine what is thebest encodingfor informationinterchange in South-Asian scripts...
Re: Devanagari
On Sun, 20 Jan 2002, Aman Chawla wrote: Taking the extra links into account the sizes are: English: 10.4 Kb Devanagari: 15.0 Kb Thus the Dev. page is 1.44 times the Eng. page. For sites providing archives of documents/manuscripts (in plain text) in Devanagari, this factor could be as high as approx. 3 using UTF-8 and around 1 using ISCII. Well a trivial adjustment is to use UTF-16 to store your documents if you know they are going to be predominantly Devangari. Or if you have so much text that the number of extra disks is going to be painful, use SCSU to bring it very close to the ISCII ratio. Of course I would note that you can store millions of pages of plain-text on a single harddisk these days. If you going to be storing so many hundreds of millions of pages of plain text that the number of extra disks is a bother, I am amazed that none of it might be outside the ISCII repetoire. And this huge document archive has no graphics component to go with it... But the real reason for publishing the data in Unicode on the web is so people not using a machine specially configured for ISCII will still be able to read and process the data. [then later wrote:] With regards to South Asia, where the most widely used modems are approx. 14 kbps, maybe some 36 kbps and rarely 56 kbps, where broadband/DSL is mostly unheard of, efficiency in data transmission is of paramount importance... how can we convince the south asian user to create websites in an encoding that would make his client's 14 kbps modem as effective (rather, ineffective) as a 4.6 kbps modem? Can you read 500 characters per second? So long as they are receiving only plain text, even this dwaddling speed is not going to impact them. People wanting to efficiently transfer data will use a compression program. Geoffrey
Re: Devanagari
In a message dated 2002-01-20 20:49:00 Pacific Standard Time, [EMAIL PROTECTED] writes: Usually, when someone offers a large body of plain text in any script, files are compressed in one way or another in order to speed up downloads. This is why I really wish that SCSU were considered a truly standard encoding scheme. Even among the Unicode cognoscenti it is usually accompanied by disclaimers about private agreement only and not suitable for use on the Internet, where the former claim is only true because of the self-perpetuating obscurity of SCSU and the latter seems completely unjustified. Devanagari text encoded in SCSU occupies exactly 1 byte per character, plus an additional byte near the start of the file to set the current window (0x14 = SC4). -Doug Ewell Fullerton, California
Re: Devanagari
In a message dated 2002-01-20 21:49:02 Pacific Standard Time, [EMAIL PROTECTED] writes: The issue was originally brought up to gather opinion from members of this list as to whether UTF-8 or ISCII should be used for creating Devanagari web pages. The point is not to criticise Unicode but to gather opinions of informed persons (list members) and determine what is the best encoding for information interchange in South-Asian scripts... It seems that the only point against Unicode compared to ISCII is the resulting document size in bytes, and this one point is being given 100% focus in the comparison. If the actual question is, What is the most efficient encoding for Devanagari text, in terms of bytes, using only the most commonly encountered encoding schemes and no external compression? then of course you will have loaded the question in favor of ISCII. But when you consider that more browsers today around the world (not just in India) are equipped to handle Unicode than ISCII, and that Unicode allows not only the encoding of ASCII and Devanagari but the full complement of Indic scripts (Oriya, Gujarati, Tamil...) as well as any other script on the planet that you could realistically want to encode, you will probably have to rethink the cost/benefit tradeoff of Unicode. -Doug Ewell Fullerton, California
Re: Devanagari
On Mon, Jan 21, 2002 at 12:57:39AM -0500, [EMAIL PROTECTED] wrote: This is why I really wish that SCSU were considered a truly standard encoding scheme. Even among the Unicode cognoscenti it is usually accompanied by disclaimers about private agreement only and not suitable for use on the Internet, where the former claim is only true because of the self-perpetuating obscurity of SCSU and the latter seems completely unjustified. Does Mozilla support it? If someone's willing to spend a little time, adding it to Mozilla is one way to make it more generally useable. And maybe then IE will get nudged into playing a little catchup . . . -- David Starner - [EMAIL PROTECTED], dvdeug/jabber.com (Jabber) Pointless website: http://dvdeug.dhis.org When the aliens come, when the deathrays hum, when the bombers bomb, we'll still be freakin' friends. - Freakin' Friends