Looking for information on the UnicodeData file
Iapologize if this question has been asked before, but I'm relatively new at this. My question is: where can I find formal definitions of the terms used in the Character Name field of the UnicodeData.txt file? Most specifically, precise explanations of designations like "turned", "inverse", "inverted", "reversed", "rotated" etc. Also the difference between "digraph" and "ligature", etc. Although I've searched the FAQ files and the rest of the unicode.org site, I haven't been able to find this info as yet. This site is huge! So can anyone provide me with an URL? Thanks. Pim Blokland
Re: Caron / Hacek?
John Hudson wrote: In the Slovak orthography, the lowercase d, l and t are normally written with the 'apostrophe' form of the accent. Then why does UnicodeData break them down as (e.g.) 0064 030C rather than 0064 0315? Pim Blokland
Re: The display of *kholam* on PCs
Chris Jacobs wrote at 12:54 AM on Wednesday, March 5, 2003: But why do you call the kholam a high left dot? As far as I know it can appear high left or middle, to indicate that is should be pronounced after the consonant, or right, to pronounce it before. So the meaning of a shin with two dots above it is ambiguous, In classical Hebrew KHOLEM always represents a trailing vowel, i.e. it is always pronounced after the consonant over which it is written. [In fact I can't think of ANY vowel sign in classical Hebrew which represents a pronunciation that precedes the consonant to which it is associated, ignoring, for obvious reasons, written/read (kethib/qere) orthographies, where the vowels indicate what is to be read in spite of the consonants that are written.] And so the graphemic sequence SHIN KHOLEM is never ambiguous in classical Hebrew. (I don't know about modern Israeli Hebrew.) About the only unusual orthographic phenomenon I can think of related to KHOLEM is that when it occurs after SIN it shares the same dot with SIN. Respectfully, Dean A. Snyder Scholarly Technology Specialist Center For Scholarly Resources, Sheridan Libraries Garrett Room, MSE Library, 3400 N. Charles St. The Johns Hopkins University Baltimore, Maryland, USA 21218 office: 410 516-6850 mobile: 410 245-7168 fax: 410-516-6229 Manager, Digital Hammurabi Project: www.jhu.edu/digitalhammurabi Manager, Initiative for Cuneiform Encoding: www.jhu.edu/ice
Re: Khmer encoding model (had no subject)
Quoting Marco Cimarosti [EMAIL PROTECTED]: Mijan wrote: [...] 3. There are no other cases of a Vowel+Virama combination in the Unicode encoding model. Yes, there are. Khmer. I do not understand Khmer but I see that it does not use the same 'encoding model'. Please look, you will see that you were wrong to use Khmer as an example. What do you mean by not using the same encoding model? There are actually three Indic scripts that have been encoded with a different model: Tibetan (subscript letters are encoded separately, rather than as combinations of virama + consonant), and Thai/Lao (reordrant vowel marks are encoded in visual order, rather than in phonetic order). But, AFAIK, this is not the case of Unicode Khmer, which is encoded in the same way as the scripts of India. Thank you for the correction. I said I do not understand Khmer. I was understanding that scripts not based on ISCII were using different encoding model Mijan - This mail sent through http://www.bangladesh.net
RE: Reph and Khmer encoding model
Quoting Kent Karlsson [EMAIL PROTECTED]: I understand that unicode is supposed to represent the language, not the way it is written. No, Unicode is supposed to be able to represent the written form. (Of course.) Yes, I was wrong! I think I wanted to say something like, Unicode is supposed to be able to represent the written language with logicaly equivalent code points. (Because the argument is, what is logicaly equivalent to ya-phalaa) Mijan form ... Let's consider the ra+virama+ya case. In the mostpart the ra+virama+ya is displayed as ya+reph. This obviously seems to be an instance of ambiguous interpretation because ra+virama+ya could also represents ra+ja-phalaa. ya+reph and ra+ja-phalaa are used in different words and have different meaning. Form this you see that ja-phalaa is not equivalent to virama-ya and is better as a separate letter in Unicode. We always thought of ya-phalaa as separate anyway. 3. There are no other cases of a Vowel+Virama combination in the Unicode encoding model. Yes, there are. Khmer. I do not understand Khmer but I see that it does not use the same 'encoding model'. Please look, you will see that you were wrong to use Khmer as an example. Khmer uses the same encoding model as most other Indic scripts, except for one point: the reph is represented via a combining character (which also means that it does not come in logical order in the text representation), so the ambiguity you refer to does not exist for Khmer. Further, Khmer could have been represented in a Tibetan-like encoding model (but isn't). Further, IIRC, independent vowels can both be subscripted (before virama/coeng) and be subscripts (after virama/coeng) in Khmer. The latter is orthographically different from using dependent vowels. /kent k - This mail sent through http://www.bangladesh.net
Re: Caron / Hacek?
Pim Blokland scripsit: Then why does UnicodeData break them down as (e.g.) 0064 030C rather than 0064 0315? To keep the upper case and lower case characters in sync for decomposition, they always have the same combining characters. For another example, G with cedilla gets the cedilla on top when it's a capital, but it still decomposes to the ordinary combining cedilla. These are essentially font-ligaturing issues. -- John Cowan http://www.ccil.org/~cowan[EMAIL PROTECTED] To say that Bilbo's breath was taken away is no description at all. There are no words left to express his staggerment, since Men changed the language that they learned of elves in the days when all the world was wonderful. --The Hobbit
Ya-phalaa
Mijan, Unicode has a mechanism for producing the ya-phalaa conjunct, namely by preceding the ya with virama. This works also in the unusual situation where the consonant the ya-phalaa modifies is an independent vowel. A + VIRAMA + YA + -AA (this is aa-yaphalaa) RA + VIRAMA + ZWJ + YA (this is the reph-ya) RA + VIRAMA + YA (this is the ra-yaphalaa) There are analogous examples of this use of ZWJ in Malayalam and Devanagari. -- Michael Everson * * Everson Typography * * http://www.evertype.com
FAQ entry (was: Looking for information on the UnicodeData file)
I've reformatted Pim Blokland's question as a Unicode FAQ. Q: What do the terms turned, inverted, reversed, rotated, inverse, digraph, and ligature used in the names of Unicode characters mean? A: These terms are basically typographical rather than Unicode-specific. A turned character is one that has been rotated 180 degrees around its center. A turned e winds up with the opening in the upper left portion. U+0259 LATIN SMALL LETTER SCHWA is a turned e. An inverted character has been flipped along the horizontal axis. An inverted e winds up with the opening in the upper right portion. There is no Unicode character representing an inverted e. A reversed character has been flipped along the vertical axis. A reversed e winds up with the opening in the lower right portion. U+0258 LATIN SMALL LETTER REVERSED E is an reversed e. A rotated character has been rotated 90 degrees, but one can't tell which way without looking at the glyph. U+213A ROTATED CAPITAL Q is a Q that has been rotated counterclockwise. Inverse means that the white parts of the glyph are made black, and vice versa. An inverse e looks like a normal e but is white on a black background. There is no Unicode character representing an inverse e. Digraphs and ligatures are both made by combining two glyphs. In a digraph, the glyphs remain separate but are placed close together. In a ligature, the glyphs are fused into a single glyph. -- A mosquito cried out in his pain, John Cowan A chemist has poisoned my brain! http://www.ccil.org/~cowan The cause of his sorrow http://www.reutershealth.com Was para-dichloro- [EMAIL PROTECTED] Diphenyltrichloroethane.(aka DDT)
RE: Ya-phalaa
At 17:41 + 2003-03-05, Andy White wrote: Unicode has a mechanism for producing the ya-phalaa conjunct, namely by preceding the ya with virama. This works also in the unusual situation where the consonant the ya-phalaa modifies is an independent vowel. A + VIRAMA + YA + -AA (this is aa-yaphalaa) RA + VIRAMA + ZWJ + YA (this is the reph-ya) RA + VIRAMA + YA (this is the ra-yaphalaa) I said that I was not going to discuss this with you any further. I can now no longer resist! :-) Saying RA + VIRAMA + ZWJ + YA = reph-ya will not be acceptable. Implementing this will break all existing implementations. All current Fonts and Bengali Unicode texts rely on Ra+Virama+Ya as being representative of the more common reph-ya. Moreover, RA + VIRAMA + YA cannot represent Ra-yaphalaa as Ra+Virama is relied upon as being representative of Reph. For example, in the Indic OpenType secifications, you will see that a Ra+Virama is recognised as reph before any other processing is applied. If this is the case (and one would like corroboration) then simply reverse the two. The solution is the same. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: The display of *kholam* on PCs
At 07:57 AM 3/5/2003, Dean Snyder wrote: About the only unusual orthographic phenomenon I can think of related to KHOLEM is that when it occurs after SIN it shares the same dot with SIN. Not always. I have not done a close analysis of manuscript sources, but I wouldn't be surprised to find that this practice is largely due to technical limitations in older typesetting systems and/or the conventions of particular script styles. The question was raised recently during our development of a set of fonts for biblical scholarship: I told the clients they had a choice of whether to combine the holam and sin dots or to have them separate. The clear preference was to have them separate. This was possible because, following the convention of the sephardic style on which the new font is based, the sin and shin dots do not sit at the *extreme* left and right of the shin letter, so there is a little extra space into which to insert a holam. This would be more difficult in an ashkenazic style, and particularly difficult in older typesetting systems that would not allow dynamic adjustment of holam relative to other marks. John Hudson Tiro Typeworks www.tiro.com Vancouver, BC [EMAIL PROTECTED] It is necessary that by all means and cunning, the cursed owners of books should be persuaded to make them available to us, either by argument or by force. - Michael Apostolis, 1467
RE: Ya-phalaa
Michael Everson wrote: [...] RA + VIRAMA + ZWJ + YA (this is the reph-ya) RA + VIRAMA + YA (this is the ra-yaphalaa) [...] ... in the Indic OpenType secifications, you will see that a Ra+Virama is recognised as reph before any other processing is applied. [...] If this is the case (and one would like corroboration) then simply reverse the two. The solution is the same. Once a botch is implemented then others will sure follow. Replacing you original botch with yet another will make the encoding model into nothing other than a hack. Seeing as one would like corroboration, there seems is no point in me wasting time by going into details. IMHO, TUS needs solid rules; Exceptions, hacks, patches, or workarounds should definitely be avoided wherever possible. (If you care to look back in the mailing list archives a few years, you will see that the a+Virama+Ya+aa kludge was originally proposed as a workaround due to the lack of a separate encoded letter) Andy
RE: Ya-phalaa
Andy, the ya-phalaa is a presentation form of cojoined YA, which is produced in Unicode by the sequence VIRAMA + YA. Encoding it as anything else makes very little sense at all. However it is pronounced today in Bengali, and however weird you feel about its being applied to an initial vowel, the fact is that it is still a presentation form of cojoined YA, and it should be encoded as such. Consider the fact that the Bhagavadgita is available in Sanskrit in Bengali script. This will certainly contain many, many examples of consonant clusters in -YA. These will all be encoded as VIRAMA + YA, not as some independent form of ya-phalaa. It is easy to point fingers about a mismatch that someone like me makes, but the Unicode encoding model for Indic scripts is very robust, and we do our best to apply it correctly. Your proposed combining ya-phalaa will do Bengali no service, as it will introduce multiple spellings for consonant clusters in -YA. I have already stated on this forum: For example, in Sanskrit and Bengali, we have the word pratyeka 'each, every'. This is derived from the Sanskrit root prati (expressing likeness or comparison) plus eka 'one'. In Sanskrit orthography i + e becomes ye and is so written. Now in Bengali this word also exists and in both languages what is written is PA + VIRAMA + RA + TA + VIRAMA + YA + E + KA. It would be absurd -- and wrong -- to spell the Sanskrit word one way and the Bengali word another, especially as it is the same word. IMHO, TUS needs solid rules; Exceptions, hacks, patches, or workarounds should definitely be avoided wherever possible. (If you care to look back in the mailing list archives a few years, you will see that the a+Virama+Ya+aa kludge was originally proposed as a workaround due to the lack of a separate encoded letter) It isn't a kludge. It is a consistent application of the rules. Ya-phalaa is a presentation form of YA in conjunction with a preceding consonant or -- a Bengali innovation -- an independent vowel. In keeping this stance, Andy, I am defending the Unicode Standards encoding principles. The Indic encoding model is constantly under attack from people who want explicit rephas, explicit half-forms, explicit ya-phalaas, and all sorts of other explicit things, which were we to encode them would make the standard very much worse than it is. To reiterate our consistency in using this model, I will give you a Malayalam example. NA + VIRAMA + MA -- NMA (a single conjunct) NA + VIRAMA + ZWNJ + MA -- NMA (with a visible virama breve above and between) NA + VIRAMA + ZWJ + MA -- NMA (with the cillaks.aram virama curl) We prefer to apply this consistency to Bengali as well. Thank you for correcting my error earlier. That kind of feedback is helpful. Beating us up because you don't like our encoding model isn't. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: The display of *kholam* on PCs
Chris Jacobs wrote at 7:27 PM on Wednesday, March 5, 2003: Chris Jacobs wrote at 12:54 AM on Wednesday, March 5, 2003: But why do you call the kholam a high left dot? As far as I know it can appear high left or middle, to indicate that it should be pronounced after the consonant, or right, to pronounce it before. So the meaning of a shin with two dots above it is ambiguous, In classical Hebrew KHOLEM always represents a trailing vowel, i.e. it is always pronounced after the consonant over which it is written. [In fact I can't think of ANY vowel sign in classical Hebrew which represents a pronunciation that precedes the consonant to which it is associated, ignoring, for obvious reasons, written/read (kethib/qere) orthographies, where the vowels indicate what is to be read in spite of the consonants that are written.] And so the graphemic sequence SHIN KHOLEM is never ambiguous in classical Hebrew. (I don't know about modern Israeli Hebrew.) When holem precedes ?, the point is placed on the upper right of the letter, as with ?? (yo¯'macaronr). When it follows the ?, the point is placed on the upper left, as in ? ('o¯bhe¯dh). When holem precedes ??, the points coincide, as with ?? (mo¯scaronecaronl). When holem follows ??, the points again coincide as with ?? (so¯t?e¯n). The letter ??? will be scarono¯ to commence a syllabe, e.g., ?? (scarono¯macaron'), and o¯s in other places. [ R.K. Harrison, Teach Yourself Biblical Hebrew ] The case of (written) Yo'MaR is not an exception. The pronunciation is yomar, the aleph not being pronounced; and therefore the KHOLEM is written after the consonant which directly precedes it in pronunciation. In the examples 'oBeD, MoSHeL, and SoTeN the KHOLEM, as expected, follows in pronunciation the letter with which it is associated. I can't make out the transcription The letter ??? will be scarono¯ to commence a syllabe, e.g., ?? (scarono¯macaron'), and o¯s in other places. and I don't have Harrison's grammar at work to check the reading; but it sounds like an explanation of how SHIN + KHOLEM are written, which has already been discussed. In the Bagster Polyglot Bible, Hebrew-English Old Testament, translation Everard van der Hooght, Genesis 1.3 weyyomer elohiem And God said the holem is clearly above the aleph, not above the yod. Same response given for YoMaR above. I see in fact _another_ example of a holem to the right, which Harrison did not mention: the holem in elohiem is above the he, not above the lamed. Due to innate complexity there is variation in Hebrew pointing in manuscripts and printed editions, even leaving aside for the moment discussion of the various Hebrew pointing traditions themselves. But, although KHOLEM following LAMED is indeed orthographically a somewhat special case (due to the fact that LAMED is the only Hebrew character to extend above the scribal line and the extension is precisely from the upper left corner of the glyph where you want to place the KHOLEM), I have nevertheless always seen it written between the LAMED and the following glyph but closer to the LAMED. This is certainly how it is taught and printed these days. I don't have my Bagster here at work but I would suspect if you looked closely, the location of the KHOLEM would be as I have suggested. If not I suspect this is idiosyncratic to works printed on that press. [I did however misspeak technically when I said after the consonant OVER which it is written. The KHOLEM pronounced after LAMED is indeed written OVER the scribal line, but is written directly AFTER the LAMED.] About the only unusual orthographic phenomenon I can think of related to KHOLEM is that when it occurs after SIN it shares the same dot with SIN. And if those dots were above different letters there were no reason why they should share. I must be missing your point here; this seems to support what I was saying. But I'm surprised that no one has provided the one possible counterexample to my statement about no vowel preceding its consonant (an example I completely forgot about when writing my former post) - furtive pathach (as in the second a-vowel in SaMeaKH). Depending on your linguistic persuasion you might argue that the PATAKH here is a vowel glide, both written and pronounced, which is merely extending a non-a-vowel before guttural consonants in certain phonemic contexts. Or you might want to posit that it is the only example of a syllable in classical Hebrew beginning with a vowel - or an unwritten consonant. Probably more than we need to know about the originally posted problem, but I have a feeling that readers of this list enjoy, like I do, discussion of these orthographic quirks of the world's writing systems. Respectfully, Dean A. Snyder Scholarly Technology Specialist Center For Scholarly Resources, Sheridan Libraries Garrett Room, MSE Library, 3400 N. Charles St. The Johns Hopkins University Baltimore, Maryland, USA 21218 office:
[OT] The project is done
Hello! My keymap is done, and is working well. I just wanted to thank everyone who helped me during the construction of all the scripts and tidbits that made it work. Thanks a lot! -Dave Oftedal -- Sonna ojamasan ni ha batsu-geemu namatako pantsu juppun!
RE: Ya-phalaa
Michael, I do not wish to get into yet another long discussion (argument) but I must reply to one point. Your proposed combining ya-phalaa will do Bengali no service, as it will introduce multiple spellings for consonant clusters in -YA. Um, actually if you look, you will not find any place where I have proposed a combining ya-phalaa. I have so-far avoided any mention of such a thing due to the reasons you give above. (I think you will find that it was Mijan that mentioned that.) Andy
Re: Malayalam Cillaksharams (was Ya-phalaa)
At 21:14 + 2003-03-05, Andy White wrote: I am replying to this portion of the reply as I feel it is a very important revelation. We weren't hiding it. This is part of the improvements to Unicode that have been made for 4.0. One of the tasks I was given was to improve the block descriptions of the Indic scripts if I could. Most have been improved rather a lot considering the time constraints we have had. In each case we endeavoured to address some of the problem areas. We are still editing. To reiterate our consistency in using this model, I will give you a Malayalam example. NA + VIRAMA + MA -- NMA (a single conjunct) NA + VIRAMA + ZWNJ + MA -- NMA (with a visible virama breve above and between) NA + VIRAMA + ZWJ + MA -- NMA (with the cillaks.aram virama curl) [...] Michael Everson -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: [OT] The project is done
On Wed, 5 Mar 2003, David Oftedal wrote: Hello! My keymap is done, and is working well. I just wanted to thank everyone who helped me during the construction of all the scripts and tidbits that made it work. I'm curious what keymap and for what language/script that is? Probably I ignored the earlier posts regarding this. Is this a keymap that is generally available for people to use? Thanks a lot! -Dave Oftedal -- Sonna ojamasan ni ha batsu-geemu namatako pantsu juppun!
RE: Ya-phalaa
. Moreover, RA + VIRAMA + YA cannot represent Ra-yaphalaa as Ra+Virama is relied upon as being representative of Reph. For example, in the Indic OpenType secifications, you will see that a Ra+Virama is recognised as reph before any other processing is applied. If this is the case (and one would like corroboration) then simply reverse the two. The solution is the same. RA + VIRAMA is a pre-base substitution and pre-base stuff gets processed first. RA + ZWNJ + VIRAMA + YA might be the way to go in order to disambiguate REPH + YA from RA + YA-PHALAA. Whatever method is chosen, it will be invisible to the user. The way text is stored on computers has nothing to do with the way text is handwritten, typed, and printed or displayed. Computer characters consist of strings of ones and zeros. The binary string which is stored by a computer to represent the LATIN CAPITAL LETTER A doesn't look anything like the letter. The important matter is that each letter needs to have a unique binary string which can be stored electronically. Lengths of such strings vary. Input methods and display need to match users' expectations, but the underlying binary string encodings do not. The users never see this. Best regards, James Kass .
length of text by different languages
I remember there were some study to show although UTF-8 encode each Japanese/Chinese characters in 3 bytes, Japanese/Chinese usually use LESS characters in writting to communicate information than alphabetic base langauges. Any one can point to me such research? Martin, do you have some paper about that ? I would like to find out the average ration between English, Geram, French, Japanese, Chinese, Korean in term of the number of characters, and in term of the bytes needed to encode in UTF-8 If such research information have not been done, maybe one way to figure the result is to take tranlated Bible fo these language from swords project, strip out those xml tag and leave the pure text, and measure the size. Since all the Bible translation communicate the same information and the volumn is huge enough, that could be a good way to find out the result. Of course, those mark up need to be taken out to reduce the noise.
RE: Ya-phalaa
I once wrote: My thoughts were to put a ZWNJ after the Ra to indicate that is not to form a Reph e.g. Ra+ZWNJ+Virama+Ya = Ra+Jophola Then I remembered that in some font designs, secondary forms such as jophola can form a conjunct ligature with the preceding consonant. I think that a ZWNJ would imply that Ra and Ya should not ligate. James Kass said: Exactly. This would seem to work without breaking anything existing and would not mean extending the semantics of ZWNJ. Have you since changed your mind about this? No! This is an example of stating something that can be read in two ways - unfortunatly you took an unintended meaning :-( Re-iterating in reverse should get the point across, I hope: I think that a ZWNJ would imply that Ra and Ya should not join together. (ZWNonJoiner) But I remembered that in some font designs Ra and Ya *do* join together (they make a ligature.) Therefore Ra+ZWNJ+Virama+Ya cannot represent Ra+Yaphalaa when they form a ligature. Andy
RE: Ya-phalaa
. Andy White wrote, No! This is an example of stating something that can be read in two ways - Hmmm, kind of like RA+VIRAMA+YA in current implementations? unfortunatly you took an unintended meaning :-( Actually, I did get the intended meaning. Unfortunately, though, I didn't get it until after my reply was sent. smile I think that a ZWNJ would imply that Ra and Ya should not join together. (ZWNonJoiner) But I remembered that in some font designs Ra and Ya *do* join together (they make a ligature.) Therefore Ra+ZWNJ+Virama+Ya cannot represent Ra+Yaphalaa when they form a ligature. So, I've had a half hour to consider how to respond to your anticipated response. smile If a font designer makes a special ligature form of RA+JOPHOLA, then the easy solution would be to put a look-up in the font's GSUB table: RA + ZWNJ + VIRAMA + YA --- my special ligature form The hard part of this, as you know, is getting something like this to actually work. But, as you also know, the people who are working on Unicode font engines, like Paul Nelson of Microsoft, are very diligent in following up on these special cases. Remember all of our talk about the KHANDA TA and note that the current experimental version of Uniscribe now seems to be properly substituting that form. Best regards, James Kass .
Re: Looking for information on the UnicodeData file
At 04:57 PM 3/5/03 +0100, Pim Blokland wrote: I apologize if this question has been asked before, but I'm relatively new at this. My question is: where can I find formal definitions of the terms used in the Character Name field of the UnicodeData.txt file? Most specifically, precise explanations of designations like turned, inverse, inverted, reversed, rotated etc. Also the difference between digraph and ligature, etc. Although I've searched the FAQ files and the rest of the unicode.org site, I haven't been able to find this info as yet. This site is huge! So can anyone provide me with an URL? Thanks. No such information exists. These are descriptive terms that have been applied somewhat consistently, but not strictly. Officially, character names are (somewhat) arbitrary, but unique identifiers of characters. They are neither always a description of the appearance of a character, nor do they always match the street name for the corresponding elements of the writing systems. A./
Re: Looking for information on the UnicodeData file
By the way, the FAQ was updated today, thanks to people on this list. Rick My question is: where can I find formal definitions of the terms used in the Character Name field of the UnicodeData.txt file? Most
RE: Ya-phalaa
Jameskass wrote: If a font designer makes a special ligature form of RA+JOPHOLA, then the easy solution would be to put a look-up in the font's GSUB table: RA + ZWNJ + VIRAMA + YA --- my special ligature form Now that simplicity makes me smile :-) I would be surprised if anyone (even diligent Paul Nelson of Microsoft) would except that a sequence containing a non-joiner should be allowed to form a ligature - I could be wrong - I await further responses to see. Andy
RE: length of text by different languages
[EMAIL PROTECTED] wrote: I remember there were some study to show although UTF-8 encode each Japanese/Chinese characters in 3 bytes, Japanese/Chinese usually use LESS characters in writting to communicate information than alphabetic base langauges. Any one can point to me such research? I don't know of exactly what you want, but I vaguely remember a paper given at a Unicode conference long ago that compared various translations of the charter (or some such) of the Voice of America in a couple or three encodings. H, let's see could be this: http://www.unicode.org/iuc/iuc9/Friday2.html#b3 Reuters Compression Scheme for Unicode (RCSU) Misha Wolf No paper online, alas. I remember that Chinese was a clear winner in terms of # of characters. In fact, I kind of remember that Chinese was so much denser that it still won after RCSU (now SCSU) compression, which would mean that a Han character contains more than twice as much info on average as a Latin letter as used in (say) English. This is all on pretty shaky ground, distant memories. Perhaps Misha stil has the figures (if that's in fact the right paper). -- François