RE: UTF-8 and UTF-16 issues
Yes. we've done this a lot. It's usually a very easy transition. Switching to UTF-16 means that you have to start representing text using a different data type (not char). This usually requires a lot of rework. It probably depends on how much legacy text processing code you have. You will need a few UTF8 aware versions of string handling functions. For example you won't be able to use the OS mblen function, because your locale encoding probably won't be UTF8. Of course you'll also need conversion functions to convert between UTF8 and the current OS/locale encoding. Joe "Jones, Bob" <[EMAIL PROTECTED]> on 06/28/2000 04:56:16 PM To: "Unicode List" <[EMAIL PROTECTED]> cc:(bcc: Joe Ross/Tivoli Systems) Subject: RE: UTF-8 and UTF-16 issues Has anyone out there taken a cross platform non-Unicode enabled legacy application and converted it to run UTF-8 instead of UTF-16? I've read Markus Kuhn's UTF-8/Unicode FAQ at http://www.cl.cam.ac.uk/~mgk25/unicode.html and while it was helpful, it only addresses Unix. I also have to consider Windows and the AS/400. With Windows, I assume you would have to build every thing with _UNICODE defined, but leave your strings as char * or char arrays and that at every input point convert from UTF-16 to UTF-8, or is there some other way within Windows to tell the OS to give you data in Unicode, especially in a chosen encoding scheme? If you have done this with a legacy app, what problems did you run into? Would you do it that way again or would you go ahead and bite the bullet and modify all your code to be able to handle UTF-16? It seems to me that the big advantage of processing in UTF-16 instead of UTF-8 is sizes are consistent for a given number of characters, but the big advantage of UTF-8 is that normal string manipulation still works and less code needs to be modified. What other pros and cons for the different encoding schemes are there? Also, what kind of tricks are used to deal with database column sizes? Currently, our applications run on SQL Server, Oracle, and DB2/400 and they all handle Unicode differently. Right now we have the same database column layout no matter which database is used. I suspect that may have to change, i.e. CHAR(10) becomes NCHAR(10) on SQL Server, CHAR(30) on Oracle, and CHAR(20) on DB2/400. Thanks, Bob Jones [EMAIL PROTECTED] -Original Message- From: Edward Cherlin [mailto:[EMAIL PROTECTED]] Sent: Sunday, June 25, 2000 7:01 PM To: Unicode List Subject: Re: UTF-8 and UTF-16 issues At 2:48 PM -0800 6/19/00, Markus Scherer wrote: >"OLeary, Sean (NJ)" wrote: > > UTF-16 is the 16-bit encoding of Unicode that includes the use of > > surrogates. This is essentially a fixed width encoding. > >certainly not. utf-16, of course, is variable-width: 1 or 2 16-bit >units per character. certainly the iuc discussion did not spread >this under "utf-16" but possibly as "ucs-2". [snip] The essential distinction that Sean refers to is not that all characters are encoded in the same length, but that all coding elements are of the same length. This is in contrast not with ISO 10646, but with "double-byte" encodings of CJK text, where escape sequences are used to switch between runs of 8-bit and 16-bit codes. The point of the distinction is that in double-byte encodings the only way to tell the length of the current character is by parsing from the beginning of the file. In Unicode, the current 16-bit value is explicitly a 16-bit character code (assigned, unassigned, or Private Use), an upper surrogate code, a lower surrogate code, or not a character code, without reference to what has gone before in the file. Edward Cherlin Generalist "A knot!" exclaimed Alice. "Oh, do let me help to undo it." Alice in Wonderland
RE: Java, SQL, Unicode and Databases
Michael, are you saying that the data type (char or nchar) doesn't matter? Are you saying that if we just use UTF-16 or wchar_t interfaces to access the data all will be fine and we will be able to store multilingual data even in fields defined as char? Maybe things aren't as bad as I feared. With respect to the web applications you describe, do they store the UTF-8 as binary data? This wouldn't work for us, since we want other data mining applications to be able to access the same data. Thanks, Joe "Michael Kaplan (Trigeminal Inc.)" <[EMAIL PROTECTED]> on 06/23/2000 10:41:39 AM To: Unicode List <[EMAIL PROTECTED]>, Joe Ross/Tivoli Systems@Tivoli Systems cc: Hossein Kushki@IBMCA Subject: RE: Java, SQL, Unicode and Databases Microsoft is very COM-based for its actual data access methods and COM uses BSTRs that are BOM-less UTF-16. Because of that, the actual storage format of any database ends up irrelevant since it will be converted to UTF-16 anyway. Given that this is what the data layers do, performance is certainly better if there does not have to be an extra call to the Windows MutliByteToWideChar to convert UTF-8 to UTF-16. So from a Windows perspective, not only is it no trouble, but it also the best possible solution! In any case, I know plenty of web people who *do* encode their strings in SQL Server databases as UTF-8 for web applications, since UTF-8 is their preference. They are willing to take the hit of "converting themselves" because when data is being read it is faster to go through no conversions at all. Michael > -- > From: [EMAIL PROTECTED][SMTP:[EMAIL PROTECTED]] > Sent: Friday, June 23, 2000 7:55 AM > To: Unicode List > Cc: Unicode List; [EMAIL PROTECTED] > Subject: Re: Java, SQL, Unicode and Databases > > > > I think that this is also true for DB2 using UTF-8 as the database > encoding. > From an application perspective, MS SQL Server is the one that gives us > the most > trouble, because it doesn't support UTF-8 as a database encoding for char, > etc. > Joe > > Kenneth Whistler <[EMAIL PROTECTED]> on 06/22/2000 06:42:20 PM > > To: "Unicode List" <[EMAIL PROTECTED]> > cc: [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED] (bcc: Joe > Ross/Tivoli > Systems) > Subject: Re: Java, SQL, Unicode and Databases > > > > > Jianping responded: > > > > > Tex, > > > > Oracle doesn't have special requirement for datatype in JDBC driver if > you use > UTF8 as database > > character set. In this case, all the text datatype in JDBC will support > Unicode data. > > > > The same thing is, of course, true for Sybase databases using UTF-8 > at the database character set, accessing them through a JDBC driver. > > But I think Tex's question is aimed at the much murkier area > of what the various database vendors' strategies are for dealing > with UTF-16 Unicode as a datatype. In that area, the answers for > what a cross-platform application vendor needs to do and for how > JDBC drivers might abstract differences in database implementations > are still unclear. > > --Ken > > >
Re: Java, SQL, Unicode and Databases
Yes, version 7. It requires us to use a different data type (nchar) if we want to store multilingual text as UTF-16. We want our applications to be database vendor independent so that customers can use any database under the covers. If all databases supported UTF-8 as an encoding for char, we could support multilingual data in the same way for all vendors. As it is, we have to use a different schema for MS SQL server than we do for the others. Joe "Tex Texin" <[EMAIL PROTECTED]> on 06/23/2000 11:50:06 AM To: Joe Ross/Tivoli Systems@Tivoli Systems cc: Unicode List <[EMAIL PROTECTED]>, Hossein Kushki@IBMCA, Vladimir Dvorkin <[EMAIL PROTECTED]>, Steven Watt <[EMAIL PROTECTED]> Subject: Re: Java, SQL, Unicode and Databases Joe, Can you expand on this a bit more? Privately if you prefer. Do you mean version 7 of MS SQL Server? I assume if it doesn't have UTF-8, it uses UTF-16. How does this being the storage encoding, become problematic? tex [EMAIL PROTECTED] wrote: > > I think that this is also true for DB2 using UTF-8 as the database encoding. > From an application perspective, MS SQL Server is the one that gives us the most > trouble, because it doesn't support UTF-8 as a database encoding for char, etc. > Joe > > Kenneth Whistler <[EMAIL PROTECTED]> on 06/22/2000 06:42:20 PM > > To: "Unicode List" <[EMAIL PROTECTED]> > cc: [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED] (bcc: Joe Ross/Tivoli > Systems) > Subject: Re: Java, SQL, Unicode and Databases > > Jianping responded: > > > > > Tex, > > > > Oracle doesn't have special requirement for datatype in JDBC driver if you use > UTF8 as database > > character set. In this case, all the text datatype in JDBC will support > Unicode data. > > > > The same thing is, of course, true for Sybase databases using UTF-8 > at the database character set, accessing them through a JDBC driver. > > But I think Tex's question is aimed at the much murkier area > of what the various database vendors' strategies are for dealing > with UTF-16 Unicode as a datatype. In that area, the answers for > what a cross-platform application vendor needs to do and for how > JDBC drivers might abstract differences in database implementations > are still unclear. > > --Ken -- Tex Texin Director, International Products Progress Software Corp. +1-781-280-4271 14 Oak Park +1-781-280-4655 (Fax) Bedford, MA 01730 USA[EMAIL PROTECTED] http://www.progress.com The #1 Embedded Database http://www.SonicMQ.comJMS Compliant Messaging- Best Middleware Award http://www.aspconnections.com Leading provider in the ASP marketplace Progress Globalization Program (New URL) http://www.progress.com/partners/globalization.htm Come to the Panel on Open Source Approaches to Unicode Libraries at the Sept. Unicode Conference http://www.unicode.org/iuc/iuc17
Re: Java, SQL, Unicode and Databases
I think that this is also true for DB2 using UTF-8 as the database encoding. >From an application perspective, MS SQL Server is the one that gives us the most trouble, because it doesn't support UTF-8 as a database encoding for char, etc. Joe Kenneth Whistler <[EMAIL PROTECTED]> on 06/22/2000 06:42:20 PM To: "Unicode List" <[EMAIL PROTECTED]> cc: [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED] (bcc: Joe Ross/Tivoli Systems) Subject: Re: Java, SQL, Unicode and Databases Jianping responded: > > Tex, > > Oracle doesn't have special requirement for datatype in JDBC driver if you use UTF8 as database > character set. In this case, all the text datatype in JDBC will support Unicode data. > The same thing is, of course, true for Sybase databases using UTF-8 at the database character set, accessing them through a JDBC driver. But I think Tex's question is aimed at the much murkier area of what the various database vendors' strategies are for dealing with UTF-16 Unicode as a datatype. In that area, the answers for what a cross-platform application vendor needs to do and for how JDBC drivers might abstract differences in database implementations are still unclear. --Ken