Re: Devanagari
- Original Message -From: "David Starner" <[EMAIL PROTECTED]>To: "Aman Chawla" <[EMAIL PROTECTED]>Cc: "James Kass" <[EMAIL PROTECTED]>; "Unicode"<[EMAIL PROTECTED]>Sent: Monday, January 21, 2002 12:19 AMSubject: Re: Devanagari> What's your point in continuing this? Most of the people on this list> already know how UTF-8 can expand the size of non-English text.The issue was originally brought up to gather opinion from members of thislist as to whether UTF-8 or ISCII should be used for creating Devanagari webpages. The point is not to criticise Unicode but to gather opinions ofinformed persons (list members) and determine what is the best encoding for informationinterchange in South-Asian scripts...
Re: Devanagari
- Original Message - From: "James Kass" <[EMAIL PROTECTED]> To: "Aman Chawla" <[EMAIL PROTECTED]>; "Unicode" <[EMAIL PROTECTED]> Sent: Monday, January 21, 2002 12:46 AM Subject: Re: Devanagari > 25% may not be 300%, but it isn't insignificant. As you note, if the > mark-up were removed from both of those files, the percentage of > increase would be slightly higher. But, as connection speeds continue > to improve, these differences are becoming almost minuscule. With regards to South Asia, where the most widely used modems are approx. 14 kbps, maybe some 36 kbps and rarely 56 kbps, where broadband/DSL is mostly unheard of, efficiency in data transmission is of paramount importance... how can we convince the south asian user to create websites in an encoding that would make his client's 14 kbps modem as effective (rather, ineffective) as a 4.6 kbps modem?
Re: Devanagari
Taking the extra links into account the sizes are: English: 10.4 Kb Devanagari: 15.0 Kb Thus the Dev. page is 1.44 times the Eng. page. For sites providing archives of documents/manuscripts (in plain text) in Devanagari, this factor could be as high as approx. 3 using UTF-8 and around 1 using ISCII. - Original Message - From: "James Kass" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Sunday, January 20, 2002 11:01 PM Subject: Re: Devanagari > > Doug Ewell wrote, > > > > > I think before worrying about the performance and storage effect on Web pages > > due to UTF-8, it might help to do some profiling and see what the actual > > impact is. > > > > The "What is Unicode?" pages offer a quick study. > > 14808 bytes (English) > 15218 bytes (Hindi) > 10808 bytes (Danish) > 11281 bytes (French) > 9682 bytes (Chinese Trad.) > > (The English page includes links to all the other scripts, but the individual > script pages only link back to the English page. So, the English page is a > bit larger than the other pages for this reason, not a fair test if we only > count the English and Hindi pages.) > > The Unicode logo gif at the top left corner of each of these pages takes > bytes. A screen shot of the beginning of the Hindi page takes > 37569 bytes as a gif, the small portion cropped and attached takes > 4939 bytes. > > The "What is Unicode?" pages are at: > http://www.unicode.org/unicode/standard/WhatIsUnicode.html > > Best regards, > > James Kass. > > Title: What is Unicode? General Information Home | Site Map | Search Goto Translations Hvad er Unicode? in Danish (Other languages will be added over time.) Display Problems Depending on the level of Unicode support in the browser you are using and whether or not you have the necessary fonts installed, you may have display problems for some of the translations, particularly with complex scripts such as Arabic. For further information, see Display Problems. More Information The Unicode Standard, Version 3.0 Technical Introduction Glossary Unicode-Enabled Products Useful Resources Unicode Consortium Contacting Unicode What is Unicode? Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one. Before Unicode was invented, there were hundreds of different encoding systems for assigning these numbers. No single encoding could contain enough characters: for example, the European Union alone requires several different encodings to cover all its languages. Even for a single language like English no single encoding was adequate for all the letters, punctuation, and technical symbols in common use. These encoding systems also conflict with one another. That is, two encodings can use the same number for two different characters, or use different numbers for the same character. Any given computer (especially servers) needs to support many different encodings; yet whenever data is passed between different encodings or platforms, that data always runs the risk of corruption. Unicode is changing all that! Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. The Unicode Standard has been adopted by such industry leaders as Apple, HP, IBM, JustSystem, Microsoft, Oracle, SAP, Sun, Sybase, Unisys and many others. Unicode is required by modern standards such as XML, Java, ECMAScript (JavaScript), LDAP, CORBA 3.0, WML, etc., and
Re: Devanagari
> The fact that UTF-8 economizes on the storage for ASCII characters, is a > benefit for *all* HTML users, as the HTML syntax is entirely in ASCII and > claims a significant fraction of the data. > A UTF-8 encoded HTML file, will therefore have (percentage-wise) less overhead > for Devanagari as claimed. Add to that James' observation on graphics files, > many of which accompany even the simplest HTML documents and you get a > percentage difference between the sizes of an English and Devanagari website > (i.e. in its entirety) that's well within the fluctuation of the typical > length in characters, for expressing the same concept in different languages. The point was that a UTF-8 encoded HTML file for an English web page carrying say 10 gifs would have a file size one-third that for a Devanagari web page with the same no. of gifs - even if you take into account the fluctuation of the typical length in characters, for expressing the same concept in different languages. This is because in some cases one language may express a concept more compactly while in other cases it may not, and on the whole this effect would balance out and can therefore be neglected. Therefore transmission of a Devanagari web page over a network would take thrice as long as that of an English web page using the same images and presenting the same information.
Devanagari Rupee Symbol
I am unable to find the Devanagari Rupee sign encoded in Unicode? Is it encoded? If not, why?
Devanagari
I would be grateful if I could get opinions on the following: 1. Which encoding/character set is most suitable for using Hindi/Marathi (both of which use Devanagari) on the internet as well as in databases, and why? In your response, please refer to: http://www.iiit.net/ltrc/Publications/iscii_plugin_display.html, particularly the following paragraphs: "Many people hope that the standardization problem will get solved because of Unicode. However there is an issue of transmission efficiency. The transmission cost for Indian languages will be three times that of English! The real culprit being UTF-8. UTF-8 converts Unicode two-byte codes to byte sequence of one to four bytes. In the process they make sure that ASCII part of the Unicode is transmitted as single byte only. So for a language like English which uses only 0-127 part of the code there is no overhead. European languages use only a few character codes in the region 128-255 in addition to 0-127 part. So in the case of the Europian languages the transmission of this portion may incur some overhead say of the order of 10%. In contrast to above cases Indian languages use no part of the code in region 0-127. Secondly Indian character codes occupy less than 127 codes for each language. So what could have been transmitted in one byte if one uses ASCII will be transmitted in a sequence of two to four bytes. This amounts to extra overhead of 200%!" 2. Related to question 1, what can be done to encourage/force the use of a standardised encoding for Devanagari on the Internet? 3. With reference to the previous question, can programs that convert the myriad Devangari encodings in use today to a standard encoding (question 1) be made freely available, and how? 4. Is there any search engine on the internet that maintains an up to date index of sites in Devanagari? If not, what can be done to encourage proprietary search engines to support Hindi? Google supposedly has a Hindi language option, but surprise, it's in Roman script! Several emails to them have elicited the response: "At the moment we don't support Devanagari..." Thanks, Aman Chawla
Unicode Devanagari Range
This is with reference to the Unicode Devanagari (Hindi) Range. Is there a way to overcome/override the automatic glyph substitution that occurs when one types a pure consonant (eg. 0926 द) + halant (094D ् ) + another consonant (0918 घ) ? When one types the previously indicated sequence, one gets a combined glyph द्घ that is extremely difficult to read and is often avoided in Devanagari printing, for the purposes of legibility. This glyph can easily be mistaken with द्ध which is (द + ् + ध). In all Devanagari newspapers, the sequence indicated above is printed as it is, without substituting the complex, combined glyph to minimise confusion.
Unicode Search Engines
Are there any search engines at all at present which allow one to search sites encoded in UTF-8? If not, are there plans to build such search engines? For example, is Google going to implement such an engine? Aman Chawla
Re: Hindi characters for transcribing the sound "e"
[EMAIL PROTECTED] type="cite"> >>This is the kind of thing I am looking for: a 'special composite matra' to write a new sound in Hindi, imported from English. >>I don't believe it exists. But what is your goal? Trying to give an idea of how English is spoken to Hindi readers? I'm not sure a new or very rare character would really >>help. My goal is to accurately transcribe English words such as 'get', 'bed' etc. into Hindi. Just as for Bengali a special character can be used to represent a sound not present in the language, similarly there should be (hopefully) a special character for this English sound. Also are there any words in Hindi that use the ऎ DEVANAGARI LETTER SHORT E or its corresponding diacritic mark ॆ? I personally have never come across one. Maybe this diacritic gives the sound of the "e" in bed or led? - Original Message - From: Patrick Andries To: Aman Chawla Cc: Unicode Sent: Tuesday, January 15, 2002 7:10 PM Subject: Re: Hindi characters for transcribing the sound "e" Aman Chawla wrote: [EMAIL PROTECTED] type="cite"> Thanks for the response Patrick. I understand your last sentence: the closest you can come to /&eps;/ is using ैYes, and I believe there is variability in the pronounciation of this grapheme within Hindi speakers. As mentioned, some authors say it is a /&eps;/ (open), some say it is a diphtong (such as English "rail"). There is nothing strange about this.Compare the pronounciation given on these two different sites : http://www.avashy.com/script/greendemo1.html (the woman pronounces the letter in isolation differently from the man, but both say /&eps;/ in aisâ) and the diphtong produced here http://faculty.maxwell.syr.edu/jishnu/101/alphabet/sounds/018ei.wav (found on http://faculty.maxwell.syr.edu/jishnu/101/alphabet/default.asp?section=0).You say it is /ae/ (I take it) as in "shall", this is corroborated by William Bright (op. cit), but Ohala writes in her article that /ae/ only occurs in English loan words such as "bat" (cricket bat)... Knowing quite well French phonology and its own diversity, I would assume the same applies to Hindi: the same letters are pronounced differently in different regions or even social classes. [EMAIL PROTECTED] type="cite"> However, in the response given to the following FAQ: http://www.unicode.org/unicode/faq/indic.html#13 you will find this sentence: "This zophola_aa can be seen as a special "composite" matra to write a new Bengali sound, imported from English." [EMAIL PROTECTED] type="cite"> This is the kind of thing I am looking for: a 'special composite matra' to write a new sound in Hindi, imported from English. I don't believe it exists. But what is your goal? Trying to give an idea of how English is spoken to Hindi readers? I'm not sure a new or very rare character would really help. [EMAIL PROTECTED] type="cite"> Mark Davis suggests that: "I just checked with the ICU online demo at http://oss.software.ibm.com/cgi-bin/icu/tr , and "e" is transliterated as U+090E "ऎ" DEVANAGARI LETTER SHORT E*. " One has to distinguish between transcription and transliteration. A transliteration only allows one to preserve the original spelling in the absence of the original alphabet. It does not indicate how this letter should be pronounced (see the various pronounciation of the English "e" in "we", "red", "the", "new", "bottle/some", "clerk") and this was your original question "how do I represent in Devanâgarî the English SOUND found in "red", "bed". A transliteration is of no help, a transcription is.Patrick A.
Re: Hindi characters for transcribing the sound "e"
Thanks for the response Patrick. I understand your last sentence: the closest you can come to /&eps;/ is using ै However, in the response given to the following FAQ: http://www.unicode.org/unicode/faq/indic.html#13 you will find this sentence: "This zophola_aa can be seen as a special "composite" matra to write a new Bengali sound, imported from English." This is the kind of thing I am looking for: a 'special composite matra' to write a new sound in Hindi, imported from English. Mark Davis suggests that: "I just checked with the ICU online demo at http://oss.software.ibm.com/cgi-bin/icu/tr, and "e" is transliterated as U+090E "ऎ" DEVANAGARI LETTER SHORT E*. "
Re: Hindi characters for transcribing the sound "e"
Yes, I am a native speaker. The Hindi word for dirt is मैल The vowel in this sounds like the vowel in the English word 'shall' (also, like the first vowel sound in the English word 'rally') and not at all like the vowel in 'bed' or 'red'. However, the Hindi word for harmony which is मेल has a vowel which does sound like the vowel sound in the English word 'bake'. In any case, the vowel sound that I am talking about is neither the one in 'shall' nor the one in 'bake', rather the one in 'bed', 'red', 'said', etc. - Original Message - From: Patrick Andries To: Aman Chawla Sent: Tuesday, January 15, 2002 1:42 AM Subject: Re: Hindi characters for transcribing the sound "e" Do you speak Hindi ? Does the word for dirt have a vowel that sounds like bed/red ? Does the word for harmony one that sounds like bake ? How do they sound for you ?If these pairs of word do not sound alike, Manjari Ohala is wrong in his article about Hindi phonetics in the Handbook of the International Phonetic AssociationAman Chawla wrote: [EMAIL PROTECTED] type="cite"> Actually, I am not talking about the sound in hay or bake or the Hindi words for dirt or harmony. Rather, the sound in bed, red, dead, led, fed, said, etc. - Original Message - From:Patrick Andries To: Aman Chawla Sent: Monday, January 14, 2002 10:24 PM Subject: Re: Hindi characters for transcribing the sound "e" Aman Chawla a écrit : [EMAIL PROTECTED] type="cite"> With reference to the FAQ: http://www.unicode.org/unicode/faq/indic.html#13 , I would like to know what are the Hindi characters used to transcribe the sound "e" (as in English "bet", "bed", "red" etc.) in Unicode. ThanksEnglish vowels, I'm not too sure about them. Let's see, are you speak of the sound "e" (the open mid-front unrounded vowel) as in Hindi /mƐl/ ("dirt" according to my sources) and not /mel/ ("harmony") ? I believe it is often translitterated "ai" and could be transcribed back in Hindi with ऐ or ै (U+0910, U+0948 as a diacritic) although it is originally a diphtong. The English closed mid-front unrounded vowel /e/ (as in hay or bake) would be transcribe with a U+090E ऎ . Patrick Andries
Re: Hindi characters for transcribing the sound "e"
The Demo doesn't seem to be particularly reliable. For instance, the following English words, all have the same vowel sound: red, said, dead, led, shed, fed. However, the Demo gave the following Latin-Devanagari outputs: रॆद् , सैद् , दॆअद् , लॆद् , शॆद् , फ़ॆद् First of all the ending 'd' sound in all the English words is ड् and not द् as given by the demo. Secondly, though 'said' and 'red' have the same vowel sound (not character, but sound), the demo gave two different Hindi diacritics. Hindi is phonetic and so each diacritic has one and only one sound. I am looking for transscription (by sound) of the "e" sound in bed, red, get etc. into a Hindi character. - Original Message - From: Mark Davis To: Aman Chawla ; Unicode Sent: Monday, January 14, 2002 9:04 PM Subject: Re: Hindi characters for transcribing the sound "e" There are two different processes: transliteration (which is by letter) and transscription (which is by sound). If transliteration is what you mean, I just checked with the ICU online demo at http://oss.software.ibm.com/cgi-bin/icu/tr, and "e" is transliterated as U+090E "ऎ" DEVANAGARI LETTER SHORT E*. ICU transliteration for Devanagari is based on ISCII (for the exact composition, see the last section of http://oss.software.ibm.com/icu/userguide/Transliteration.html, called "Script Transliteration Sources". Mark * I use the demo fairly often simply to get hex converted to and from characters, and characters converted to and from names. — Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο πάντα — Ὁμήρου Μαργίτῃ[For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr] http://www.macchiato.com - Original Message - From: Aman Chawla To: Unicode Sent: Monday, January 14, 2002 05:48 Subject: Hindi characters for transcribing the sound "e" With reference to the FAQ: http://www.unicode.org/unicode/faq/indic.html#13, I would like to know what are the Hindi characters used to transcribe the sound "e" (as in English "bet", "bed", "red" etc.) in Unicode. Thanks
Re: Hindi characters for transcribing the sound "e"
Actually, I am not talking about the sound in hay or bake or the Hindi words for dirt or harmony. Rather, the sound in bed, red, dead, led, fed, said, etc. - Original Message - From: Patrick Andries To: Aman Chawla Sent: Monday, January 14, 2002 10:24 PM Subject: Re: Hindi characters for transcribing the sound "e" Aman Chawla a écrit : [EMAIL PROTECTED] type="cite"> With reference to the FAQ: http://www.unicode.org/unicode/faq/indic.html#13 , I would like to know what are the Hindi characters used to transcribe the sound "e" (as in English "bet", "bed", "red" etc.) in Unicode. ThanksEnglish vowels, I'm not too sure about them. Let's see, are you speak of the sound "e" (the open mid-front unrounded vowel) as in Hindi /mƐl/ ("dirt" according to my sources) and not /mel/ ("harmony") ? I believe it is often translitterated "ai" and could be transcribed back in Hindi with ऐ or ै (U+0910, U+0948 as a diacritic) although it is originally a diphtong. The English closed mid-front unrounded vowel /e/ (as in hay or bake) would be transcribe with a U+090E ऎ. Patrick Andries
Hindi characters for transcribing the sound "e"
With reference to the FAQ: http://www.unicode.org/unicode/faq/indic.html#13, I would like to know what are the Hindi characters used to transcribe the sound "e" (as in English "bet", "bed", "red" etc.) in Unicode. Thanks