Re: length of text by different languages

2003-03-08 Thread Jon Babcock
Correction.

I just checked my old Japanese moji(character)-to-English 
calculations and I think 1.8-2.8 to 1 is a more realistic ratio 
than the 2.3-3.2 I mentioned. (Comparing this to the 1.4-1.8 to 
1 that I use for Chinese would indicate that Chinese is slighlty 
more "efficient" than Japanese.)

Also, I compared the Japanese and English translations of the 
Bible (both done by the same source for the same general 
readership), and came up with from 1.9-2.29 to 1, as the 
moji-to-English conversion ratio. It varies depending on how I 
estimate the number of moji and the number of English characters 
per page.

Jon

--
Jon Babcock <[EMAIL PROTECTED]>



Re: length of text by different languages

2003-03-08 Thread Jon Babcock
Yung-Fong Tang wrote:


Ram Viswanadha wrote:

There is also some information at
http://oss.software.ibm.com/icu/docs/papers/binary_ordered_compression_for_unicode.html#Test_Results
Not sure if this is what you are looking for.
thanks. not really. I am not look into the ratio caused by encoding. But 
rather the ratio caused by language itself. For example, in order to 
communicate the idea "I want to eat chicken for dinner tonight", French, 
German using the same encoding may use different number of characters to 
communicate the same "IDEA".
"Efficency" here is dependent on the translation and varies 
widely. (See example below.) That's why the practical experience 
of professional translators will probably provide the best 
answer. I have already mentioned what is, in my experience, the 
range for contemporary Japanese-English and Chinese-English.

These ratios are important to JE and CE translators because we 
usually get paid by the English word. But it usually takes more 
work to use less words. So, if we don't want to be penalized for 
using concise English, we try to charge by the character count 
in the Chinese or Japanese source text. To quote a rate to our 
clients, we must calculate what the "efficiency ratio" -- to 
coin a term here -- is for our translations in this particular 
field.

If you want to calculate this ratio yourself, I agree with your 
idea of using Bible translations, although the number of proper 
names may skew the results compared, for example, to technical 
translations. But it woud be a good place to start.

One example, from thousands, found on yesterday's honyaku ML:

イメージ合成写真です --> 'simlulated photograph' or 'the 
photograph shown is for illustration only" , i.e., from 21 to 45 
characters in English, the target language. Decide how many 
bytes you're going use to encode the Japanese and the English 
strings here, and you'll get the "efficiency ratio" in this case.

Jon





--
Jon Babcock <[EMAIL PROTECTED]>



Re: length of text by different languages

2003-03-07 Thread Yung-Fong Tang






Ram Viswanadha wrote:

  
  
  
 
  
 

  There is also some information at
 
  http://oss.software.ibm.com/icu/docs/papers/binary_ordered_compression_for_unicode.html#Test_Results
 
   
 
  Not sure if this is what you are looking
 for.
 
   

thanks. not really. I am not look into the ratio caused by encoding. But
rather the ratio caused by language itself. For example, in order to communicate
the idea "I want to eat chicken for dinner tonight", French, German using
the same encoding may use different number of characters to communicate the
same "IDEA".
Misha's paper help a lot. but unfortunately it lack of japanese and German
data.





Re: length of text by different languages

2003-03-06 Thread Ram Viswanadha



There is also some information at
http://oss.software.ibm.com/icu/docs/papers/binary_ordered_compression_for_unicode.html#Test_Results
 
Not sure if this is what you are looking 
for.
 
Regards,
 
Ram Viswanadha

  - Original Message - 
  From: 
  Yung-Fong 
  Tang 
  To: Francois Yergeau 
  Cc: [EMAIL PROTECTED] 
  Sent: Thursday, March 06, 2003 2:33 
  PM
  Subject: Re: length of text by different 
  languages
  Francois Yergeau wrote:
  [EMAIL PROTECTED] wrote:
  
I remember there were some study to show although UTF-8 encode each 
Japanese/Chinese characters in 3 bytes, Japanese/Chinese usually use 
LESS characters in writting to communicate information than 
alphabetic base langauges.

Any one can point to me such research?

I don't know of exactly what you want, but I vaguely remember a paper given
at a Unicode conference long ago that compared various translations of the
charter (or some such) of the Voice of America in a couple or three
encodings.  H, let's see  could be this:

http://www.unicode.org/iuc/iuc9/Friday2.html#b3
Reuters Compression Scheme for Unicode (RCSU) 
Misha Wolfyea. That could be it. I got a hard copy and it 
  looks like the Fig 2 is the one I am looking for.
  
No paper online, alas.  I remember that Chinese was a clear winner in terms
of # of characters.  In fact, I kind of remember that Chinese was so much
denser that it still won after RCSU (now SCSU) compression, which would mean
that a Han character contains more than twice as much info on average as a
Latin letter as used in (say) English.

This is all on pretty shaky ground, distant memories.  Perhaps Misha stil
has the figures (if that's in fact the right paper).

  


Re: length of text by different languages

2003-03-06 Thread Yung-Fong Tang
thanks, everyone. But I want to point out the punct and " " itself 
should also be consider in your future caculation. Japanese and Chinese, 
Thai do not use " " between word, and Latin based (or Greek, 
Koeran,Cyrillic, Arabic, Armenian Georgian, etc) does use " " and when 
used for estimate size, those should also be caculated.




Re: length of text by different languages

2003-03-06 Thread Yung-Fong Tang
Francois Yergeau wrote:

http://www.unicode.org/iuc/iuc9/Friday2.html#b3
Reuters Compression Scheme for Unicode (RCSU) 
Misha Wolf
 

Unfortunately, no information about Germany or Japanese. :(

It only have Chinese, Frasi, Urdu, Russian, Arabic, Hindi, Korean , 
Creole, Thai, French, Czech, Turkish, Polish, Armenain, Greek, English, 
Vietnamese, Albanian, Spanish

Anyone have data about that two languages (Germany or Japanese) ?






Re: length of text by different languages

2003-03-06 Thread Yung-Fong Tang






Francois Yergeau wrote:

  [EMAIL PROTECTED] wrote:
  
  
I remember there were some study to show although UTF-8 encode each 
Japanese/Chinese characters in 3 bytes, Japanese/Chinese usually use 
LESS characters in writting to communicate information than 
alphabetic base langauges.

Any one can point to me such research?

  
  
I don't know of exactly what you want, but I vaguely remember a paper given
at a Unicode conference long ago that compared various translations of the
charter (or some such) of the Voice of America in a couple or three
encodings.  H, let's see  could be this:

http://www.unicode.org/iuc/iuc9/Friday2.html#b3
Reuters Compression Scheme for Unicode (RCSU) 
Misha Wolf

yea. That could be it. I got a hard copy and it looks like the Fig 2 is the
one I am looking for.


  

No paper online, alas.  I remember that Chinese was a clear winner in terms
of # of characters.  In fact, I kind of remember that Chinese was so much
denser that it still won after RCSU (now SCSU) compression, which would mean
that a Han character contains more than twice as much info on average as a
Latin letter as used in (say) English.

This is all on pretty shaky ground, distant memories.  Perhaps Misha stil
has the figures (if that's in fact the right paper).

  






Re: length of text by different languages

2003-03-06 Thread Jon Babcock
Yung-Fong Tang wrote:
I remember there were some study to show although UTF-8 encode each 
Japanese/Chinese characters in 3 bytes, Japanese/Chinese usually use 
LESS characters in writting to communicate information than alphabetic 
base langauges.
For my commercial Japanese-to-English translation work, I 
estimate from 2.3 to 3.2 Japanese characters for one word of 
English, estimated at 6 characters. It varies depending on the 
kanji to kana ratio in the source text.

For commercial contemporary Chinese-to-English translation, I 
estimate 1.4 to 1.8 Chinese characters per English word, 
estimated at 6 characters. (I just asked about this on a mailing 
list devoted to C-E/E-C translation and the one translator who 
responded said he uses 1.62 Chinese characters per English word 
which agrees with my experience.)

Since a "word" is probably about the smallest chunk of meaning 
you're going find, this would suggest that where it takes 6 
bytes to encode a word of English at one-byte per character, at 
3 bytes per character, it will take from about 4.3 to 3.3 bytes 
to encode a word of Chinese, I guess.

The above applies to contemporary (modern) traditional Chinese. 
I don't know if there is a practical difference in efficiency 
between traditonal and simplified. But from my experience with 
classical Chinese, I would guess that most classical Chinese is 
at least twice as efficient as modern Chinese. (This, plus its 
freedom from any tight dependence on sound, facilitated its 
great success as the language of culture throughout the 
traditional kanji culture realm --- China, Korea, Japan, 
Vietnam, etc., imo.)

FWIW,

Jon

--
Jon Babcock <[EMAIL PROTECTED]>



Re: length of text by different languages

2003-03-06 Thread Doug Ewell
Yung-Fong Tang  wrote:

> I remember there were some study to show although UTF-8 encode each
> Japanese/Chinese characters in 3 bytes, Japanese/Chinese usually use
> LESS characters in writting to communicate information than alphabetic
> base langauges.
>
> Any one can point to me such research? Martin, do you have some paper
> about that ?

You are possibly thinking of a paper called "re-ordering.txt" by Bruce
Thomson.

In the IDN (internationalized domain name) working group, in late 2001,
there was a proposal by Soobok Lee to improved the compression of domain
names containing Hangul characters by reordering them so that the most
common characters would be closer together.  This was considered
significant because of the 63-byte limit imposed on DNS labels.  All IDN
applications would have required huge mapping tables in order to
implement this.  Lee's proposal included reordering tables for other
scripts, but it was obvious that his primary goal was to optimize
compression for Hangul.

Thomson's paper was basically a distillation of the working group's
arguments for and against Lee's reordering proposal.  It was intended to
be neutral, but ended up refuting many of the pro-reordering arguments.

One of Lee's claims was that Hangul was represented in Unicode in an
unfairly inefficient way, because each Hangul syllable consumes 2 bytes
in UTF-16 and 3 bytes in UTF-8, while direct encoding of jamos instead
of syllables is even more inefficient.  In response, Thomson wrote that
the Book of Genesis in various languages requires:

3088 characters in English using ASCII
778 characters in Chinese using Han characters
1201 characters in Korean using Hangul syllables

and combined this data with the average compression achieved by
AMC-ACE-Z (now called "Punycode") to derive meaningful comparisons.

It stands to reason that a logographic or syllable-based encoding will
pack more information into each code unit than an alphabetic encoding.

I can provide a copy of Thomson's paper if Tang or anyone else is
interested.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/




RE: length of text by different languages

2003-03-05 Thread Francois Yergeau
[EMAIL PROTECTED] wrote:
> I remember there were some study to show although UTF-8 encode each 
> Japanese/Chinese characters in 3 bytes, Japanese/Chinese usually use 
> LESS characters in writting to communicate information than 
> alphabetic base langauges.
> 
> Any one can point to me such research?

I don't know of exactly what you want, but I vaguely remember a paper given
at a Unicode conference long ago that compared various translations of the
charter (or some such) of the Voice of America in a couple or three
encodings.  H, let's see  could be this:

http://www.unicode.org/iuc/iuc9/Friday2.html#b3
Reuters Compression Scheme for Unicode (RCSU) 
Misha Wolf

No paper online, alas.  I remember that Chinese was a clear winner in terms
of # of characters.  In fact, I kind of remember that Chinese was so much
denser that it still won after RCSU (now SCSU) compression, which would mean
that a Han character contains more than twice as much info on average as a
Latin letter as used in (say) English.

This is all on pretty shaky ground, distant memories.  Perhaps Misha stil
has the figures (if that's in fact the right paper).

-- 
François