RE: Arabic Script converting Between different Code Pages
If you are on a Microsoft platform and have the code page support for the arabic code page, then a simple MultiByteToWideChar call will take care of it. Here are the code page numbers to use: Arabic (ASMO 708): 708 Arabic (DOS): 720 Arabic (ISO): 28596 Arabic (Mac): 10004 Arabic (Windows): 1256 The Mac one seem to require windows 2000, although perhaps MLang would allow the conversion using its libraries if you have it installed (comes with IE4/IE5, etc.). Michael -- From: Magda Danish (Unicode)[SMTP:[EMAIL PROTECTED]] Sent: Tuesday, June 27, 2000 9:44 AM To: Unicode List Subject: FW: Arabic Script converting Between different Code Pages -Original Message- From: Akil Fahd [mailto:[EMAIL PROTECTED]] Sent: Tuesday, June 27, 2000 8:47 AM To: [EMAIL PROTECTED] Subject: Arabic Script converting Between different Code Pages Is there a standard method for converting between various Arabic Code Pages (ISO 8859-6, Arabic Mac, ASMO 708, ISO 8859-1, Windows-1256)and Unicode 3.0 Akil Fahd Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com
RE: Java, SQL, Unicode and Databases
Microsoft is very COM-based for its actual data access methods and COM uses BSTRs that are BOM-less UTF-16. Because of that, the actual storage format of any database ends up irrelevant since it will be converted to UTF-16 anyway. Given that this is what the data layers do, performance is certainly better if there does not have to be an extra call to the Windows MutliByteToWideChar to convert UTF-8 to UTF-16. So from a Windows perspective, not only is it no trouble, but it also the best possible solution! In any case, I know plenty of web people who *do* encode their strings in SQL Server databases as UTF-8 for web applications, since UTF-8 is their preference. They are willing to take the hit of "converting themselves" because when data is being read it is faster to go through no conversions at all. Michael -- From: [EMAIL PROTECTED][SMTP:[EMAIL PROTECTED]] Sent: Friday, June 23, 2000 7:55 AM To: Unicode List Cc: Unicode List; [EMAIL PROTECTED] Subject: Re: Java, SQL, Unicode and Databases I think that this is also true for DB2 using UTF-8 as the database encoding. From an application perspective, MS SQL Server is the one that gives us the most trouble, because it doesn't support UTF-8 as a database encoding for char, etc. Joe Kenneth Whistler [EMAIL PROTECTED] on 06/22/2000 06:42:20 PM To: "Unicode List" [EMAIL PROTECTED] cc: [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED] (bcc: Joe Ross/Tivoli Systems) Subject: Re: Java, SQL, Unicode and Databases Jianping responded: Tex, Oracle doesn't have special requirement for datatype in JDBC driver if you use UTF8 as database character set. In this case, all the text datatype in JDBC will support Unicode data. The same thing is, of course, true for Sybase databases using UTF-8 at the database character set, accessing them through a JDBC driver. But I think Tex's question is aimed at the much murkier area of what the various database vendors' strategies are for dealing with UTF-16 Unicode as a datatype. In that area, the answers for what a cross-platform application vendor needs to do and for how JDBC drivers might abstract differences in database implementations are still unclear. --Ken
RE: UTF-8 BOM Nonsense
Yes, I do feel this way, actually. :-) The standard is quite clear in its language, one does not have to be a semanticist to understand that: 1) XML *is* considered to be UTF-8 if there is no BOM and UTF-16 if there is. 2) The encoding tag was added in recongition of the fact that #1 will not be enough for some people 3) An XML parser *must* be able to do #1, but #2 is not required. So I would not discourage the tag, ever. But I would also never do XML that was not UTF-8 or UTF-16. :-) Michael -- From: Robert A. Rosenberg[SMTP:[EMAIL PROTECTED]] Sent: Friday, June 23, 2000 11:34 AM To: Michael Kaplan (Trigeminal Inc.) Cc: Unicode List Subject: RE: UTF-8 BOM Nonsense At 11:31 AM 06/22/2000 -0800, Michael Kaplan (Trigeminal Inc.) wrote: I do not believe that this will require it to be added to a standard, and this is a non-standard usage, but life is about dealing with things as they are (and this is how they are!). I assume that you also feel that the charset parm on a MIME Email Header (or HTML/XML header) is not needed and thus should be discouraged. The use of the BOM character at the start of a TEXT file serves the same purpose as the charset tag - It says "I am in UTF-8 format" (so you do not try to treat it as ISO-8859-x, CP1252, or some other encoding format).
RE: Java, SQL, Unicode and Databases
The datatype *does* matter in that sense you would use UTF-16 data fields (NTEXT and NCHAR and NVARCHAR) and access it with your favorite data access method, which will convert as needed to whatever format IS uses. You will never know oc care what the underlying engine stores. The web site stuff will not work for you since you would have to do the extra conversions to do the data mining, so you would probably go with plan "A". My general point is that OLE DB to an Oracle UTF-8 field and to a SQL Server UTF-16 field all return the same type of data UTF-16. So COM in this case is hiding the differences. Michael -- From: [EMAIL PROTECTED][SMTP:[EMAIL PROTECTED]] Sent: Friday, June 23, 2000 2:27 PM To: Michael Kaplan (Trigeminal Inc.) Cc: Unicode List; [EMAIL PROTECTED] Subject: RE: Java, SQL, Unicode and Databases Michael, are you saying that the data type (char or nchar) doesn't matter? Are you saying that if we just use UTF-16 or wchar_t interfaces to access the data all will be fine and we will be able to store multilingual data even in fields defined as char? Maybe things aren't as bad as I feared. With respect to the web applications you describe, do they store the UTF-8 as binary data? This wouldn't work for us, since we want other data mining applications to be able to access the same data. Thanks, Joe "Michael Kaplan (Trigeminal Inc.)" [EMAIL PROTECTED] on 06/23/2000 10:41:39 AM To: Unicode List [EMAIL PROTECTED], Joe Ross/Tivoli Systems@Tivoli Systems cc: Hossein Kushki@IBMCA Subject: RE: Java, SQL, Unicode and Databases Microsoft is very COM-based for its actual data access methods and COM uses BSTRs that are BOM-less UTF-16. Because of that, the actual storage format of any database ends up irrelevant since it will be converted to UTF-16 anyway. Given that this is what the data layers do, performance is certainly better if there does not have to be an extra call to the Windows MutliByteToWideChar to convert UTF-8 to UTF-16. So from a Windows perspective, not only is it no trouble, but it also the best possible solution! In any case, I know plenty of web people who *do* encode their strings in SQL Server databases as UTF-8 for web applications, since UTF-8 is their preference. They are willing to take the hit of "converting themselves" because when data is being read it is faster to go through no conversions at all. Michael -- From: [EMAIL PROTECTED][SMTP:[EMAIL PROTECTED]] Sent: Friday, June 23, 2000 7:55 AM To: Unicode List Cc: Unicode List; [EMAIL PROTECTED] Subject: Re: Java, SQL, Unicode and Databases I think that this is also true for DB2 using UTF-8 as the database encoding. From an application perspective, MS SQL Server is the one that gives us the most trouble, because it doesn't support UTF-8 as a database encoding for char, etc. Joe Kenneth Whistler [EMAIL PROTECTED] on 06/22/2000 06:42:20 PM To: "Unicode List" [EMAIL PROTECTED] cc: [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED] (bcc: Joe Ross/Tivoli Systems) Subject: Re: Java, SQL, Unicode and Databases Jianping responded: Tex, Oracle doesn't have special requirement for datatype in JDBC driver if you use UTF8 as database character set. In this case, all the text datatype in JDBC will support Unicode data. The same thing is, of course, true for Sybase databases using UTF-8 at the database character set, accessing them through a JDBC driver. But I think Tex's question is aimed at the much murkier area of what the various database vendors' strategies are for dealing with UTF-16 Unicode as a datatype. In that area, the answers for what a cross-platform application vendor needs to do and for how JDBC drivers might abstract differences in database implementations are still unclear. --Ken
RE: Bengali: variants of same conjunct
Thus since people who write the language sent both, cut Do you mean that Tamil writers *purposely* use both the "ancient" and the "modern" forms in the same document? What is the intent? yes, that is what am I saying. If you go to several of the Tamil resource sites on the web, you can see both of them used, often in the same documents. This is VERY easy to do with the hack fonts, significantly more difficult if you are using Unicode-enabled fonts. And I believe this is entirely a rendering problem that is (far) outside Unicode's scope. I do not see how, if BOTH forms are in use and one form is not renderable in a font that is Unicode compliant, how this would NOT be considered a Unicode issue. It is crucial that language as used should be possible to render with Unicode, should it not? The ligatures you mention do not really call into the same category as the Tamil case, since all of them can be rendered using the 3.0 (or even the 2.0!) standard. I do know that the TamilNadu government has specific issues with the Unicode standard, is this not one of the issues? Or do they prefer only the usage outlined in the standard, in order to encourage people to use it? And would this then be a case of the standard being more involved in politics than might be good? Michael
RE: UTF-8 BOM Nonsense
I agree Gary. Windows 2000 Notepad, however, does not agree and writes one. Since Notepad in prior versions of Windows was in fact the defacto standard for HTML editor (g), clearly it is a program to be reckoned with. People should be aware of the fact that there are going to MANY files out there that are UTF-8 and do have a BOM. I do not believe that this will require it to be added to a standard, and this is a non-standard usage, but life is about dealing with things as they are (and this is how they are!). Michael -- From: Gary L. Wade[SMTP:[EMAIL PROTECTED]] Sent: Thursday, June 22, 2000 9:08 AM To: Unicode List Subject: UTF-8 BOM Nonsense Please! After hundreds of e-mails on this topic, let it die! The BOM is only useful with UTF-16 or UCS-4 characters. There is no such thing as byte ordering when each character is a byte or a multibyte sequence with a well-documented ordering denoting how to interpret this! For further reference, turn to page 20 in the Unicode 3.0 book and let us get back to more important things, such as how to represent the price of tea in China! ;-) -- Gary L. Wade Product Development Consultant DesiSoft Systems | Voice: 214-642-6883 9619 E. Valley Ranch Parkway | Fax: 972-506-7478 Suite 2125 | E-Mail: [EMAIL PROTECTED] Irving, TX 75063 |
RE: Bengali: variants of same conjunct
Thus far it is something that has been implemented in the fonts, rather than anywhere else for example there are several ligatures in Tamil that will display one way with the Latha font and the other way with Monotype Tamil Arial (the way set out in Unicode 3.0 is done in the latter). Thus since people who write the language sent both, they have a not entirely satisfying workaround specify different fonts for different parts of the document. Michael -- From: Christopher John Fynn[SMTP:[EMAIL PROTECTED]] Sent: Wednesday, June 21, 2000 11:17 AM To: Unicode List Cc: Arijit Upadhyay Subject: Re: Bengali: variants of same conjunct This is something which I think needs to be *specified* in the standard. This is an instance where it should not be left up to implementers to invent their own character combinations to force one kind of ligature or another. - Chris - Original Message - From: Arijit Upadhyay [EMAIL PROTECTED] To: Unicode List [EMAIL PROTECTED] Sent: Wednesday, June 21, 2000 5:37 PM Subject: Re: Bengali: variants of same conjunct Thanks Everybod, now I am in possession of the replies and I find that you have found it just right. The gif Abdul has put up is the exact thing I am referring to. http://www.btinternet.com/~abdulmalik/nyu.gif Now, if I cannot implement the available normal options and Abdul has offered a number of options. What in your opinion would be the best possible method to implement this type of variants in present scenario. To be honest, I am basically a font/ web designer and just on the learning of unicode enabliing of Indic fonts. Thanks again, Arijit
RE: Linguistic precedence [was: (TC304.2313) AND/OR:
Well, "Gre" does not appear between "Deu" and "Esp" on any European language, but "Gre" does appear between "Ger" and "Spa" so I am assuming English names were being used here? Michael -- From: Robert A. Rosenberg[SMTP:[EMAIL PROTECTED]] Sent: Thursday, June 15, 2000 1:27 PM To: Unicode List Cc: Unicode List Subject: RE: Linguistic precedence [was: (TC304.2313) AND/OR: At 07:53 AM 06/15/2000 -0800, Michael Kaplan (Trigeminal Inc.) wrote: Eventually someone will have a language name that does not fit or a language like German will inist on sorting sooner, under Deutsch rather than under German, etc. (which I personally think makes more sense than making a locale take someone's translation of their language name, FWIW). Since it was stated that Greek was displayed between German and Spanish, I;d assume that German was Deutsch since Spanish is Espanol (not sure if that "n" is "n" or "ñ" as well as if my spelling is correct).
RE: Linguistic precedence [was: (TC304.2313) AND/OR: antediluvian
On the cover of my French driver's license, it says ``Driving license'' in 10 languages (all the EU languages at the time it was printed). The titles are ordered alphabetically by the name of the language in the language itself. The Portuguese don't seem to mind. (Fair enough, this only works because all bar one of the EU languages use the Latin script, and Greek has a standard transliteration. It comes between German and Spanish.) Sounds great...but *which* alphabetic ordering are we talking about? :-) Actually, in the case of the 10 EU languages being referred to, I do not think there would be any dissention as to the order, would there be? Admittedly if Lithuania was in the EU and there were countries that started with a "Y" there as well, there would be problems with people who did not understand the order and thought that the "Y" countries were jumping the queue, but AFAIK there is no real difference except for Greek, and no one disagress with its placement. Michael
RE: Linguistic precedence [was: (TC304.2313) AND/OR: antediluvian
I admit to nitpicking because in this particular case, the language names, we may be just lucky so that there are no collation conflicts. I believe this is an accurate statement... .we ARE lucky, so far. But believing that there is a collation order that works across all the European languages is a very hopeless fallacy I agree. Eventually someone will have a language name that does not fit or a language like German will inist on sorting sooner, under Deutsch rather than under German, etc. (which I personally think makes more sense than making a locale take someone's translation of their language name, FWIW). People can always find something to fight about, sometimes we help them by not frustrating them and making them look quite so hard. :-) (Has somebody written a comprehensive collection of all these collation problems?) Well, the only ones I regularly deal with are the Traditional Spanish sort and the Lithuanian sort, there are undoutably others. The accent/grave issues are easier and few people would really object to sorts that place the characters with diacriticals right after the base characters. So it is really only DIFFERENT sorts that make people upset. And, in future, Lithuania may be a member in the EU. (Where *do* they sort 'Y', by the way?) They sort "Y" after "I". The second i18N bug in a software product that I ever had to fix was relaetd to this issue. :-) Michael