Re: length of text by different languages
Correction. I just checked my old Japanese moji(character)-to-English calculations and I think 1.8-2.8 to 1 is a more realistic ratio than the 2.3-3.2 I mentioned. (Comparing this to the 1.4-1.8 to 1 that I use for Chinese would indicate that Chinese is slighlty more "efficient" than Japanese.) Also, I compared the Japanese and English translations of the Bible (both done by the same source for the same general readership), and came up with from 1.9-2.29 to 1, as the moji-to-English conversion ratio. It varies depending on how I estimate the number of moji and the number of English characters per page. Jon -- Jon Babcock <[EMAIL PROTECTED]>
Re: length of text by different languages
Yung-Fong Tang wrote: Ram Viswanadha wrote: There is also some information at http://oss.software.ibm.com/icu/docs/papers/binary_ordered_compression_for_unicode.html#Test_Results Not sure if this is what you are looking for. thanks. not really. I am not look into the ratio caused by encoding. But rather the ratio caused by language itself. For example, in order to communicate the idea "I want to eat chicken for dinner tonight", French, German using the same encoding may use different number of characters to communicate the same "IDEA". "Efficency" here is dependent on the translation and varies widely. (See example below.) That's why the practical experience of professional translators will probably provide the best answer. I have already mentioned what is, in my experience, the range for contemporary Japanese-English and Chinese-English. These ratios are important to JE and CE translators because we usually get paid by the English word. But it usually takes more work to use less words. So, if we don't want to be penalized for using concise English, we try to charge by the character count in the Chinese or Japanese source text. To quote a rate to our clients, we must calculate what the "efficiency ratio" -- to coin a term here -- is for our translations in this particular field. If you want to calculate this ratio yourself, I agree with your idea of using Bible translations, although the number of proper names may skew the results compared, for example, to technical translations. But it woud be a good place to start. One example, from thousands, found on yesterday's honyaku ML: イメージ合成写真です --> 'simlulated photograph' or 'the photograph shown is for illustration only" , i.e., from 21 to 45 characters in English, the target language. Decide how many bytes you're going use to encode the Japanese and the English strings here, and you'll get the "efficiency ratio" in this case. Jon -- Jon Babcock <[EMAIL PROTECTED]>
Re: length of text by different languages
Ram Viswanadha wrote: There is also some information at http://oss.software.ibm.com/icu/docs/papers/binary_ordered_compression_for_unicode.html#Test_Results Not sure if this is what you are looking for. thanks. not really. I am not look into the ratio caused by encoding. But rather the ratio caused by language itself. For example, in order to communicate the idea "I want to eat chicken for dinner tonight", French, German using the same encoding may use different number of characters to communicate the same "IDEA". Misha's paper help a lot. but unfortunately it lack of japanese and German data.
Re: length of text by different languages
There is also some information at http://oss.software.ibm.com/icu/docs/papers/binary_ordered_compression_for_unicode.html#Test_Results Not sure if this is what you are looking for. Regards, Ram Viswanadha - Original Message - From: Yung-Fong Tang To: Francois Yergeau Cc: [EMAIL PROTECTED] Sent: Thursday, March 06, 2003 2:33 PM Subject: Re: length of text by different languages Francois Yergeau wrote: [EMAIL PROTECTED] wrote: I remember there were some study to show although UTF-8 encode each Japanese/Chinese characters in 3 bytes, Japanese/Chinese usually use LESS characters in writting to communicate information than alphabetic base langauges. Any one can point to me such research? I don't know of exactly what you want, but I vaguely remember a paper given at a Unicode conference long ago that compared various translations of the charter (or some such) of the Voice of America in a couple or three encodings. H, let's see could be this: http://www.unicode.org/iuc/iuc9/Friday2.html#b3 Reuters Compression Scheme for Unicode (RCSU) Misha Wolfyea. That could be it. I got a hard copy and it looks like the Fig 2 is the one I am looking for. No paper online, alas. I remember that Chinese was a clear winner in terms of # of characters. In fact, I kind of remember that Chinese was so much denser that it still won after RCSU (now SCSU) compression, which would mean that a Han character contains more than twice as much info on average as a Latin letter as used in (say) English. This is all on pretty shaky ground, distant memories. Perhaps Misha stil has the figures (if that's in fact the right paper).
Re: length of text by different languages
thanks, everyone. But I want to point out the punct and " " itself should also be consider in your future caculation. Japanese and Chinese, Thai do not use " " between word, and Latin based (or Greek, Koeran,Cyrillic, Arabic, Armenian Georgian, etc) does use " " and when used for estimate size, those should also be caculated.
Re: length of text by different languages
Francois Yergeau wrote: http://www.unicode.org/iuc/iuc9/Friday2.html#b3 Reuters Compression Scheme for Unicode (RCSU) Misha Wolf Unfortunately, no information about Germany or Japanese. :( It only have Chinese, Frasi, Urdu, Russian, Arabic, Hindi, Korean , Creole, Thai, French, Czech, Turkish, Polish, Armenain, Greek, English, Vietnamese, Albanian, Spanish Anyone have data about that two languages (Germany or Japanese) ?
Re: length of text by different languages
Francois Yergeau wrote: [EMAIL PROTECTED] wrote: I remember there were some study to show although UTF-8 encode each Japanese/Chinese characters in 3 bytes, Japanese/Chinese usually use LESS characters in writting to communicate information than alphabetic base langauges. Any one can point to me such research? I don't know of exactly what you want, but I vaguely remember a paper given at a Unicode conference long ago that compared various translations of the charter (or some such) of the Voice of America in a couple or three encodings. H, let's see could be this: http://www.unicode.org/iuc/iuc9/Friday2.html#b3 Reuters Compression Scheme for Unicode (RCSU) Misha Wolf yea. That could be it. I got a hard copy and it looks like the Fig 2 is the one I am looking for. No paper online, alas. I remember that Chinese was a clear winner in terms of # of characters. In fact, I kind of remember that Chinese was so much denser that it still won after RCSU (now SCSU) compression, which would mean that a Han character contains more than twice as much info on average as a Latin letter as used in (say) English. This is all on pretty shaky ground, distant memories. Perhaps Misha stil has the figures (if that's in fact the right paper).
Re: length of text by different languages
Yung-Fong Tang wrote: I remember there were some study to show although UTF-8 encode each Japanese/Chinese characters in 3 bytes, Japanese/Chinese usually use LESS characters in writting to communicate information than alphabetic base langauges. For my commercial Japanese-to-English translation work, I estimate from 2.3 to 3.2 Japanese characters for one word of English, estimated at 6 characters. It varies depending on the kanji to kana ratio in the source text. For commercial contemporary Chinese-to-English translation, I estimate 1.4 to 1.8 Chinese characters per English word, estimated at 6 characters. (I just asked about this on a mailing list devoted to C-E/E-C translation and the one translator who responded said he uses 1.62 Chinese characters per English word which agrees with my experience.) Since a "word" is probably about the smallest chunk of meaning you're going find, this would suggest that where it takes 6 bytes to encode a word of English at one-byte per character, at 3 bytes per character, it will take from about 4.3 to 3.3 bytes to encode a word of Chinese, I guess. The above applies to contemporary (modern) traditional Chinese. I don't know if there is a practical difference in efficiency between traditonal and simplified. But from my experience with classical Chinese, I would guess that most classical Chinese is at least twice as efficient as modern Chinese. (This, plus its freedom from any tight dependence on sound, facilitated its great success as the language of culture throughout the traditional kanji culture realm --- China, Korea, Japan, Vietnam, etc., imo.) FWIW, Jon -- Jon Babcock <[EMAIL PROTECTED]>
Re: length of text by different languages
Yung-Fong Tang wrote: > I remember there were some study to show although UTF-8 encode each > Japanese/Chinese characters in 3 bytes, Japanese/Chinese usually use > LESS characters in writting to communicate information than alphabetic > base langauges. > > Any one can point to me such research? Martin, do you have some paper > about that ? You are possibly thinking of a paper called "re-ordering.txt" by Bruce Thomson. In the IDN (internationalized domain name) working group, in late 2001, there was a proposal by Soobok Lee to improved the compression of domain names containing Hangul characters by reordering them so that the most common characters would be closer together. This was considered significant because of the 63-byte limit imposed on DNS labels. All IDN applications would have required huge mapping tables in order to implement this. Lee's proposal included reordering tables for other scripts, but it was obvious that his primary goal was to optimize compression for Hangul. Thomson's paper was basically a distillation of the working group's arguments for and against Lee's reordering proposal. It was intended to be neutral, but ended up refuting many of the pro-reordering arguments. One of Lee's claims was that Hangul was represented in Unicode in an unfairly inefficient way, because each Hangul syllable consumes 2 bytes in UTF-16 and 3 bytes in UTF-8, while direct encoding of jamos instead of syllables is even more inefficient. In response, Thomson wrote that the Book of Genesis in various languages requires: 3088 characters in English using ASCII 778 characters in Chinese using Han characters 1201 characters in Korean using Hangul syllables and combined this data with the average compression achieved by AMC-ACE-Z (now called "Punycode") to derive meaningful comparisons. It stands to reason that a logographic or syllable-based encoding will pack more information into each code unit than an alphabetic encoding. I can provide a copy of Thomson's paper if Tang or anyone else is interested. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
RE: length of text by different languages
[EMAIL PROTECTED] wrote: > I remember there were some study to show although UTF-8 encode each > Japanese/Chinese characters in 3 bytes, Japanese/Chinese usually use > LESS characters in writting to communicate information than > alphabetic base langauges. > > Any one can point to me such research? I don't know of exactly what you want, but I vaguely remember a paper given at a Unicode conference long ago that compared various translations of the charter (or some such) of the Voice of America in a couple or three encodings. H, let's see could be this: http://www.unicode.org/iuc/iuc9/Friday2.html#b3 Reuters Compression Scheme for Unicode (RCSU) Misha Wolf No paper online, alas. I remember that Chinese was a clear winner in terms of # of characters. In fact, I kind of remember that Chinese was so much denser that it still won after RCSU (now SCSU) compression, which would mean that a Han character contains more than twice as much info on average as a Latin letter as used in (say) English. This is all on pretty shaky ground, distant memories. Perhaps Misha stil has the figures (if that's in fact the right paper). -- François