Re: Bengali: variants of same conjunct
Michael Kaplan wrote: Thus far it is something that has been implemented in the fonts, rather than anywhere else for example there are several ligatures in Tamil that will display one way with the Latha font and the other way with Monotype Tamil Arial (the way set out in Unicode 3.0 is done in the latter). Thus since people who write the language sent both, cut Do you mean that Tamil writers *purposely* use both the "ancient" and the "modern" forms in the same document? What is the intent? I can see a similar (but far less acute) problem with Latin lowercase a, which can have two forms (and similarly for the g or the ae ligature). For a, one can at the extreme limit use U+0251 for the alternate, but outside IPA I do not see any use for this distinction. For g or æ, I do not see any way to specify that one wants the rounded (script, italic) form for the left part, or the printed-like or upright form. OTOH, I do not see anyone having a problem with that. In fact, I myself don't mix them (except for IPA), even if depending of context I may use one or another form when writing. And I believe this is entirely a rendering problem that is (far) outside Unicode's scope. Antoine
RE: Case mapping errors?
(This message is send in UTF-8. Flames regarding that fact will be deleted without response.) No, those case mappings are not in error. Nor are their canonical mappings in error. (The MICRO SIGN would have had a canonical mapping to Greek mu, if it had not been included in such much-used repertioires as Latin-1.) For the PROSGEGRAMMENI it's my understanding that it is customary (in e.g. dictionaries) to capitalise it the way it is done in Unicode. (But I don't know classical Greek.) The MICRO, OHM, KELVIN, and ANGSTROM (ÅNGSTRÖM, really) SIGNs are included in Unicode for compatability reasons only. You should not use them but use the characters that they canonically (or 'near canonically' in the case of MICRO SIGN) decompose to. Note that there are many (SI or other) unit names that are *not* included as separate characters, like symbols for Watt, Volt, etc. Nor is there any need to include them. Those symbols are just letters reused for unit symbols. The case mappings for these signs derive from the characters that they (near) canonically map to. It's true that you should never case change a unit symbol or unit prefix symbol, but goes also for W, V, m, M, etc. too, even though those can only be represented by "LETTER" characters. As far as I know, the inclusion of the MICRO and OHM signs derive from their inclusion in repertiores that otherwise contain only Latin letters (and punctuation), but apparently someone found these Greek letters important enough (for use in writing unit designations) to include those two Greek letters with a name reflecting why they where included. This does not remove the fact that they are just ordinary Greek letters really. For the Kelvin and Ångström I can only speculate as to why they where included in a source (for Unicode) Korean encoding. My theory is that the Kelvin sign started out as a DEGREE KELVIN in analogy with the DEGREE CELSIUS and DEGREE FAHRENHEIT signs (which have a (small) justification as ligatures, especially in CJK typography), until someone pointed out that its not called (nor written) "degree Kelvin" but just "Kelvin". My theory about the Ångström sign's original inclusion in that Korean encoding is that someone might have thought that the A with a ring was not just a letter, but some special invented symbol (easy mistake to do if you only know that unit as "angstrom"). It's not a specially invented symbol, it's just the first letter in Mr Ångström's name, just like for Watt, Volt, Kelvin, ... The case mappings are correct, but you should never apply any case mapping to unit symbols that are letters. Getting software to "understand" what is a unit symbol (without special markup) and what is not might be tricky when the unit symbols are written with letters (as all SI, except for the degree symbol, and many other units are)... And no, reincluding all letters (or letter combinations) as "signs" for each and every reuse letters have been put to (e.g. unit signs) is not an appropriate solution. Please, never use those "SIGN"s, except when mapping those letters to character repertoires which do not contain the proper Greek letters, but do contain those "SIGN"s. Nor should you use any of the other "squared" unit characters, except when you absolutely have to get the "squared" typographic effect (but that's ugly in my eyes) in CJK typography from plain text. Note still that there are many (composite) (SI) unit designations that do not have any "squared" character associated with it. The "squared" unit characters is a rather random collection, best forgotten. Kind regards /kent k -Original Message- From: John O'Conner [mailto:[EMAIL PROTECTED]] Sent: Thursday, June 22, 2000 12:15 AM To: Unicode List Subject: Case mapping errors? There are 5 characters that are giving me a little discomfort because of their case mappings: * U+00B5 MICRO SIGN * U+1FBE GREEK PROSGEGRAMMENI * U+2126 OHM SIGN * U+212A KELVIN SIGN * U+212B ANGSTROM SIGN Each of these have case mappings...and I really don't understand why. It appears that all of these have no "round-trip" capability to map back from another case. I suppose this can be argued for a lot of mapppings. The most difficult cases are 2126, 212A, and 212B. These characters are "letter-like" in their glyph appearance, but it seems that their actual semantics are not. It seems like someone may have looked at KELVIN SIGN for example, decided it looked like a Latin-1 'K' and gave it the same lowercase mapping. Still, would you really expect to lowercase a KELVIN SIGN to a small 'k'. I can't imagine...but I may not be as imaginative as some. I have the same argument for OHM SIGN and ANGSTROM SIGN. Although they have case mappings, are they expected by most people? If I were using the OHM, ANGSTROM, or KELVIN SIGN in my work, I would be very surprised in a case operation changed them...maybe I would
Chinese characters in Java Applet
Title: Chinese characters in Java Applet Hello, I am trying to to display chinese characters stored in Unicode format in oracle database through a Java applet in the browser. The applet uses JDBC calls and thin driver. The oracle resides on Sun Solaris server . But the applet is not showing the characters correctly. My browser has chinese fonts. Do I need to have something else at client side ? What all additional things are needed to accomplish the chinese character display in the applet ? Thanks and Rgds, Parvinder
Re: UTF-8N?
John Cowan wrote: Now suppose we have a character sequence beginning with U+FEFF U+0020. This would be encoded as follows: US-ASCII: (not possible) UTF-16: 0xFE 0xFF 0xFE 0xFF 0x00 0x20 ... UTF-16: 0xFF 0xFE 0xFF 0xFE 0x20 0x00 ... UTF-16BE: 0xFE 0xFF 0x00 0x20 ... UTF-16LE: 0xFF 0xFE 0x20 0x00 ... UTF-8N: 0xEF 0xBB 0xBF 0x20 ... UTF-8B: 0xEF 0xBB 0xBF 0xEF 0xBB 0xBF 0x20 ... There is something I should have missed. It was my understanding that U+FEFF when received as first character should be seen as BOM and not as a character, and handled accordingly. So I expected: US-ASCII: 0x20 UTF-16: 0xFE 0xFF 0x00 0x20 ... UTF-16: 0xFF 0xFE 0x20 0x00 ... UTF-16BE: 0xFE 0xFF 0x00 0x20 ... UTF-16LE: 0xFF 0xFE 0x20 0x00 ... UTF-8N: 0xEF 0xBB 0xBF 0x20 ... UTF-8B: 0xEF 0xBB 0xBF 0x20 ... Antoine
RE: Bengali: variants of same conjunct
Thus since people who write the language sent both, cut Do you mean that Tamil writers *purposely* use both the "ancient" and the "modern" forms in the same document? What is the intent? yes, that is what am I saying. If you go to several of the Tamil resource sites on the web, you can see both of them used, often in the same documents. This is VERY easy to do with the hack fonts, significantly more difficult if you are using Unicode-enabled fonts. And I believe this is entirely a rendering problem that is (far) outside Unicode's scope. I do not see how, if BOTH forms are in use and one form is not renderable in a font that is Unicode compliant, how this would NOT be considered a Unicode issue. It is crucial that language as used should be possible to render with Unicode, should it not? The ligatures you mention do not really call into the same category as the Tamil case, since all of them can be rendered using the 3.0 (or even the 2.0!) standard. I do know that the TamilNadu government has specific issues with the Unicode standard, is this not one of the issues? Or do they prefer only the usage outlined in the standard, in order to encourage people to use it? And would this then be a case of the standard being more involved in politics than might be good? Michael
Re: UTF-8N?
On 06/21/2000 03:09:43 PM [EMAIL PROTECTED] wrote: Appropriate or not, users (you know, those people who don't read the documentation that the programmers don't write) will use text editors to split files. They will then concatenate the files using a non-Unicode aware tool. And they will complain that the checksums mismatch. I can't argue against that. I think the suggestion that BOM and ZWNBSP be de-unified, which I have heard before, may make the best sense. Peter Constable
Re: Case mapping errors?
These characters are purely coded for compatibility. Unicode does not distinguish letters by the abbreviations that they happen to be used in. There is no difference in semantics between the "g" in "go" vs. the "g" in "12g", nor between the "Å" in "Århus" vs. the "Å" in "15Å", nor -- for that matter -- the "U" in "Underwood" vs the "U" in "UTF-8". Mark John O'Conner wrote: There are 5 characters that are giving me a little discomfort because of their case mappings: * U+00B5 MICRO SIGN * U+1FBE GREEK PROSGEGRAMMENI * U+2126 OHM SIGN * U+212A KELVIN SIGN * U+212B ANGSTROM SIGN Each of these have case mappings...and I really don't understand why. It appears that all of these have no "round-trip" capability to map back from another case. I suppose this can be argued for a lot of mapppings. The most difficult cases are 2126, 212A, and 212B. These characters are "letter-like" in their glyph appearance, but it seems that their actual semantics are not. It seems like someone may have looked at KELVIN SIGN for example, decided it looked like a Latin-1 'K' and gave it the same lowercase mapping. Still, would you really expect to lowercase a KELVIN SIGN to a small 'k'. I can't imagine...but I may not be as imaginative as some. I have the same argument for OHM SIGN and ANGSTROM SIGN. Although they have case mappings, are they expected by most people? If I were using the OHM, ANGSTROM, or KELVIN SIGN in my work, I would be very surprised in a case operation changed them...maybe I would be disappointed or frustrated even. Are these bugs in the spec? Or do I just need to think about them a little differently? Best regards, John O'Conner
Re: UTF-8N?
[EMAIL PROTECTED] wrote: ... I think the suggestion that BOM and ZWNBSP be de-unified, which I have heard before, may make the best sense. *If* that's the solution, it should be done yesterday. The longer it takes the more implementations (and data) there will be that needs to be changed. - Chris
Re: Chinese characters in Java Applet
On Thu, Jun 22, 2000 at 02:20:39 -0800, Parvinder Singh(EHPT) wrote: I am trying to to display chinese characters stored in Unicode format in oracle database through a Java applet in the browser. The applet uses JDBC calls and thin driver. The oracle resides on Sun Solaris server . But the applet is not showing the characters correctly. My browser has chinese fonts. Do I need to have something else at client side ? What all additional things are needed to accomplish the chinese character display in the applet ? Yes, you need to tell client-side AWT which platform fonts to use. I have posted a sample font.properties entries for win32 just few days ago, solaris is not very different. If you missed that post of mine, just drop me a note and I'll forward it to you. SY, Uwe -- [EMAIL PROTECTED] | Zu Grunde kommen http://www.ptc.spbu.ru/~uwe/| Ist zu Grunde gehen
RE: How to distinguish UTF-8 from Latin-* ?
At 12:12 PM 06/20/2000 -0800, Kenneth Whistler wrote: Bob Rosenberg wrote: This was my concern, there is no way to distinguish UTF-8 from Latin-1 in case of upper ASCII characters here. Yes there is - its called a "Sanity Check". You parse the file looking for High-ASCII. If you find none - you are US-ASCII (or ISO-8859-1). Once you find one, you use the UTF-8 Suffix method to see how long the string should be IF it is UTF-8. Look at the next x characters to see if they have the correct suffix. If not, count as a Bad-UTF-8. If so, count as one Good-UTF-8. Once you roll off the end of the string resume scanning for another High-ASCII and do the check again. After finding 12 strings that start with High-ASCII (or bopping off the end of the file) check your GOOD/BAD counts. All BAD means ISO-8859-1. All GOOD means UTF-8. Well, not necessarily. Granted, the distribution of precedent bytes and successor bytes in UTF-8, when interpreted as ISO 8859-1, mostly results in gibberish that is unlikely to appear in real text. The first byte of a two-byte UTF-8 sequence consists essentially of an accented capital letter in 8859-1 (0xC0..0xDF). And the successor bytes are either C1 controls or come from the set of miscellaneous symbols, currency signs, punctuation, etc., that are rather unlikely to occur directly following an uppercase accented Latin letter. But if I invented a hoity-toity company name with extra accents for "class", such as, L·DÏ·DÀ® Productions, Inc. and sent this to you in ISO 8859-1, as I am currently doing, your sanity check will fail in this case and identify this file as UTF-8, with 3 characters misinterpreted. (i.e., LbulletDGreek letter etaD. Productions, Inc.) Of course, a further check for irregular sequence UTF-8 would discover that 0xC0 0xAE == U+002E is not shortest form UTF-8, and might, therefore, not actually be UTF-8, but even that cannot really be relied on. True you can FAKE an incorrect evaluation by plugging a trick string into an otherwise low ASCII file/message. My comment was aimed at normal (not a faked) files. I agree that missed the extra sanity check of looked for shortest string but if I remember the rules correctly, there is no requirement the shortest form be emitted - only a strong suggestion to do so (with a stronger suggestion to accept it [ie: "Be liberal with what you accept and conservative with what you create"]). I doubt that a real ISO-8859-1 file could be mistaken for a UTF-8 one without it being specially constructed to trick the sanity check. Note that the 12 string "universe" is just an attempt to check for false positives and could be adjusted for circumstances. Mixed (with most being BAD) is ISO-8859-1 (the Goods are "noise"). Mostly Good with a few Bad are either malformed UTF-8 or ISO-8859-1 (with the bad luck of finding 2 byte strings that LOOK LIKE UTF-8). Even entirely GOOD can have that bad luck, as this email itself demonstrates. Since this is a special message that was designed to spoof not a real message, I do not regard it as bad luck. If you can supply a set of normal text that would give a false reading, I'd be much more willing to say that my claim of just doing a sanity check was overly simplistic. --Ken
RE: How to distinguish UTF-8 from Latin-* ?
-Original Message- From: Robert A. Rosenberg [mailto:[EMAIL PROTECTED]] ... [on overlong UTF-8 sequences, a few lines down:] faked) files. I agree that missed the extra sanity check of looked for shortest string but if I remember the rules correctly, there is no requirement the shortest form be emitted - only a strong suggestion to do so (with a stronger suggestion to accept it [ie: "Be liberal with what you accept and conservative with what you create"]). Well, there is a security aspect to this: sometimes given texts need to be scanned to try to determine if they are "harmless" or may trigger some undesirable interpretation (as interpreted program code, like shell-script, for instance). A hacker may try to hide characters that trigger the undesired, and potentially dangerous, interpretation, by using overlong UTF-8 sequences. If the security scanner program does not "decode" overlong UTF-8 sequences, but the interpreter accepts them as if nothing was wrong, things you would not like to happen might happen. So overlong UTF-8 sequences should be regarded as errors, and not as a coding for any character at all. Yes, you may regard systems that at all have "escapes" into "execute this" mode as ill-designed. But they are around. Kind regards /kent k
Re: UTF-8N?
"Ayers, Mike" wrote: Am I reading this wrong? Here's what I get: I hand you a UTF-16 document. This document is: FE FF 00 48 00 65 00 6C 00 6C 00 6F ..so it says "Hello". Then I say, "Oh, by the way, that's big-endian." *POOF* The content of the document has changed, and there is now a 'ZERO WIDTH NO BREAK SPACE' at the beginning. Smells pretty skunky... No, what you have said is that this document is in "UTF16-BE" encoding. That's a name for an encoding that is known a priori to be BE, and does not permit a BOM. It is not the name for an encoding that has a BOM but just happens to be BE. Since you have changed the encoding, the content has naturally changed too, just as if you had declared an 8859-1 document to be 8859-2. BTW, what is a ZWNBSP anyway? From here it seems like a non-character. Is there an actual use for it? Yes. It indicates that a line break may not be introduced at this point. It is similar to the NO-BREAK SPACE (U+00A0) which you may be familiar with under its HTML name of nbsp;, except that it doesn't produce any actual whitespace. ZWNBSP is useful in languages that don't use whitespace, and in strings like "M.T.A." where a line breaker might be tempted to break after a period. Its opposite number is ZWSP (U+200B), which likewise doesn't generate any actual whitespace, but indicates that line breaking *is* permitted here. -- Schlingt dreifach einen Kreis um dies! || John Cowan [EMAIL PROTECTED] Schliesst euer Aug vor heiliger Schau, || http://www.reutershealth.com Denn er genoss vom Honig-Tau, || http://www.ccil.org/~cowan Und trank die Milch vom Paradies.-- Coleridge (tr. Politzer)
Re: UTF-8N?
Kenneth Whistler wrote: Now we are pushing through the long, bureaucratic process of getting this accepted into 10646-1, so it we maintain synchronicity with a joint publication of it as a *standard* character. So a fair statement of what you hope to achieve is: U+2060 will be the zero-width non-breaking space, or zero-width word joiner depending on how you look at it, and U+FFFE will be a byte order mark, which MAY (but SHOULD NOT) be used with the same semantics as U+2060. -- Schlingt dreifach einen Kreis um dies! || John Cowan [EMAIL PROTECTED] Schliesst euer Aug vor heiliger Schau, || http://www.reutershealth.com Denn er genoss vom Honig-Tau, || http://www.ccil.org/~cowan Und trank die Milch vom Paradies.-- Coleridge (tr. Politzer)
RE: UTF-8 BOM Nonsense
I agree Gary. Windows 2000 Notepad, however, does not agree and writes one. Since Notepad in prior versions of Windows was in fact the defacto standard for HTML editor (g), clearly it is a program to be reckoned with. People should be aware of the fact that there are going to MANY files out there that are UTF-8 and do have a BOM. I do not believe that this will require it to be added to a standard, and this is a non-standard usage, but life is about dealing with things as they are (and this is how they are!). Michael -- From: Gary L. Wade[SMTP:[EMAIL PROTECTED]] Sent: Thursday, June 22, 2000 9:08 AM To: Unicode List Subject: UTF-8 BOM Nonsense Please! After hundreds of e-mails on this topic, let it die! The BOM is only useful with UTF-16 or UCS-4 characters. There is no such thing as byte ordering when each character is a byte or a multibyte sequence with a well-documented ordering denoting how to interpret this! For further reference, turn to page 20 in the Unicode 3.0 book and let us get back to more important things, such as how to represent the price of tea in China! ;-) -- Gary L. Wade Product Development Consultant DesiSoft Systems | Voice: 214-642-6883 9619 E. Valley Ranch Parkway | Fax: 972-506-7478 Suite 2125 | E-Mail: [EMAIL PROTECTED] Irving, TX 75063 |
Java, SQL, Unicode and Databases
I want to write an application in Java that will store information in a database using Unicode. Ideally the application will run with any database that supports Unicode. One would presume that the JDBC driver would take care of any differences between databases so my application could be independent of database. (OK, I know it is a naive view.) However, I am hearing that databases from different vendors require use of different datatypes or limit you to using certain datatypes if you want to store Unicode. Changing datatypes would I presume make a significant different in my programming of the application... So, I want to make a list of the changes I need to make to my Java, SQL application in the event I want to support each of the major databases (Oracle 8I, MS SQL Server 7, etc.) with respect to Unicode data storage. (I am sure there are other differences programming to different databases, independent of Unicode data, but those issues are understood.) So, if you can help me by identifying specific changes you would make to query or update a major vendor's database with respect to Unicode support, I would be very appreciative. If I get a good list, I'll post it back here. I am most interested in Oracle and MS SQL Server, but will collect info on any database. As an example, I am hearing that some databases would require varchar, others nchar, for Unicode data. tex -- Tex Texin Director, International Products Progress Software Corp. +1-781-280-4271 14 Oak Park +1-781-280-4655 (Fax) Bedford, MA 01730 USA[EMAIL PROTECTED] http://www.progress.com The #1 Embedded Database http://www.SonicMQ.comJMS Compliant Messaging- Best Middleware Award http://www.aspconnections.com Leading provider in the ASP marketplace Progress Globalization Program (New URL) http://www.progress.com/partners/globalization.htm Come to the Panel on Open Source Approaches to Unicode Libraries at the Sept. Unicode Conference http://www.unicode.org/iuc/iuc17
English as she is spoke
I got some amusing results when I tried out the Altavista translation service on segments of the new language descriptions in http://www.unicode.org/unicode/standard/WhatIsUnicode.html Original (English): What is Unicode? Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one. Before Unicode was invented, there were hundreds of different encoding systems for assigning these numbers. No single encoding could contain enough characters: for example, the European Union alone requires several different encodings to cover all its languages. Even for a single language like English no single encoding was adequate for all the letters, punctuation, and technical symbols in common use. These encoding systems also conflict with one another. That is, two encodings can use the same number for two different characters, or use different numbers for the same character. Any given computer (especially servers) needs to support many different encodings; yet whenever data is passed between different encodings or platforms, that data always runs the risk of corruption. Hand-translated into German, on that page: Was ist Unicode? Unicode gibt jedem Zeichen seine eigene Nummer, platformunabhängig, programmunabhängig, sprachunabhängig. Grundsätzlich arbeiten Computer nur mit Zahlen. Buchstaben und andere Zeichen werden daher Zahlen zugeordnet, um sie zu speichern. Vor der Erfindung von Unicode gab es hunderte unterschiedlicher Kodierungssysteme. Keines dieser Kodierungssyteme umfasste je genug Zeichen: so braucht die Europäische Union allein mehrere Kodierungssysteme, um damit den Bedarf für die Sprachen aller Mitgliedsländer abzudecken. Nicht einmal für eine einzelne Sprache wie Englisch oder Deutsch gab es ein Kodierungssystem das wirklich alle Buchstaben, Interpunktionszeichen und alle gebräuchlichen technischen Zeichen umfasste. Diese Kodierungssysteme sind untereinander unverträglich, denn unterschiedliche Kodierungen können dieselbe Zahl für verschiedene Zeichen benutzen, oder verschiedene Zahlen für dasselbe Zeichen. Jeder Rechner (vor allem Server) muß viele verschiedene Kodierungssysteme unterstützen; und wenn Text zwischen verschiedenen Kodierungssystemen oder Rechnersystemen ausgetauscht wird, läuft dieser Text Gefahr, verstümmelt zu werden. Altavista: German to English: What is university code? University code gives its own number, platformunabhängig, to each character programmunabhängig, sprachunabhängig. Grundsätzlich operate computers only with numbers. Letters and other characters are assigned numbers in order to store it. Before the invention of university code there were hundred different coding systems. None this Kodierungssyteme covered ever enough characters: thus the Europäi union needs alone several coding systems, in order to cover with it the requirement für the languages of all Mitgliedsländer. Not even für an individual language such as English or German gave it a coding system that really all letters, punctuation characters and all gebräuchlichen technical characters covered. These coding systems are among themselves unverträglich, because different coding können the same number für different characters use, or different numbers für the same character. Each computer (above all server) muß many different coding systems unterstützen; and if text between different coding systems or computer systems is exchanged, this text danger läuft to be verstümmelt. Altavista: English to German: Was ist Unicode? Unicode stellt eine eindeutige Zahl für jedes Zeichen, egal was die Plattform, egal was das Programm, egal was die Sprache zur Verfügung. Grundlegend beschäftigen Computer gerade Zahlen. Sie speichern Zeichen und andere Zeichen, indem sie eine Zahl für jede zuweisen. Bevor Unicode erfunden wurde, gab es Hunderte der unterschiedlichen verschlüsselsysteme für das Zuweisen dieser Zahlen. Keine kodierung konnte genügende Zeichen enthalten: z.B. benötigt der europäische Anschluß alleine einige unterschiedliche Encodings, alle seine Sprachen zu umfassen. Sogar für eine einzelne Sprache wie Englisch war keine kodierung für alle Zeichen, Interpunktion und technischen Symbole in allgemeinem Gebrauch ausreichend. Diese verschlüsselsysteme widersprechen auch miteinander. Das heißt, können zwei Encodings die gleiche Zahl für zwei unterschiedliche Zeichen verwenden, oder verwenden Sie unterschiedliche Zahlen für das gleiche Zeichen. Irgendwelche gegebenen Notwendigkeiten des Computers (besonders Servers),viele unterschiedliche Encodings zu unterstützen; dennoch, wann immer Daten zwischen
Re: UTF-8N?
On 06/21/2000 06:33:57 PM [EMAIL PROTECTED] wrote: The standard doesn't ever discuss the BOM in the context of UTF-8, See section 13.6 (page 324). Sure enough. Well, there you go: the confusion is officially sanctioned! Peter Constable
Re: Bengali: variants of same conjunct
Michael Kaplan wrote: Thus since people who write the language sent both, cut Do you mean that Tamil writers *purposely* use both the "ancient" and the "modern" forms in the same document? What is the intent? yes, that is what am I saying. Okay, I did not know (and I did not notice any example thereof; but I do not read Tamil either ;-)). But what is the semantic intent, then? In other words, what may mean the use of "elephant-trunk" ai vs the "normal" one? What may mean the use of the rounded naa vs the "normal", two parts, one? Are we talking about that, by the way? And are they any other differences? [The different forms for Latin a, g or æ] And I believe this is entirely a rendering problem that is (far) outside Unicode's scope. I do not see how, if BOTH forms are in use and one form is not renderable in a font that is Unicode compliant, how this would NOT be considered a Unicode issue. Because there is no semantic difference between them. Similarly, if you use a font like Poetica, there are a vast numbers of different glyphs for . Does anyone consider encoding this in Unicode? It is crucial that language as used should be possible to render with Unicode, should it not? I disagree. For example, when I want to insist on one point, I use several technics. When I speak, I speak louder and a bit slower; when I wrote a note, I use bolder font; on Internet, I use asterisks. All of these are part of the language, and as such are to be kept with the text. But I do not believe it have to be encoded in Unicode: this would simply lead too far in a multi-language world. Usage of glyphic variations is in my mind even less significant, so should also be dropped. The ligatures you mention do not really call into the same category as the Tamil case, since all of them can be rendered using the 3.0 (or even the 2.0!) standard. Please explain to me how you render the script form of æ using a standard upright font like Helvetica (not the expert variation). Or else the two-bowl form of g with Courier? Or did I miss your point? I do know that the TamilNadu government has specific issues with the Unicode standard, is this not one of the issues? Perhaps, I do not know. In fact, I cannot figure what issues the TN goverment really have. Or do they prefer only the usage outlined in the standard, in order to encourage people to use it? Please do not forget that while Tamil Nadu is the principal place where Tamil is spoken, it is not the only one, as Tamil is spoken all around the Indian Ocean. When I speak about French usage, I can only give testimonies. The various French official agencies in charge of the language have a bit more power, but it is far from things like "thou shalt use this rendering form"... (for example, if a bill were passed to eradicate \oe or ÿ in French, usage will survive for years, and Unicode will have to continue to support it, not to mention the other French-speaking countries that may easily chose to _not_ apply the bill themselves). Antoine