Hi Phillippe, UTF-16 is (kind of) UCS-2...
What's your system code page? System.out.println uses your system code page to display characters--it does an implicit conversion. To check your code, try this: char[] c = myUCSString.toCharArray(); for (int x=0; x<c.length; x++) { System.out.print(Integer.toHexString((int)c[x]) + " "); } This will show you the actual hex values of the characters (as a string). I should note that UTF-8 isn't a valid character set for SQL Server. You need to use the nvarchar/nchar data type for your database to store Unicode. You can't choose UTF-8 as the code page for SQL Server 2000. Storing UTF-8 in your SQL Server is a recipe for problems (especially since you MUST not use code page 1252 to lie to the database). I have more information on encodings in databases in this whitepaper (from Unicode Conference 19): http://www.inter-locale.com/IUC19.pdf Hope that helps. Best Regards, Addison Addison P. Phillips Director, Globalization Architecture webMethods, Inc. 432 Lakeside Drive Sunnyvale, California, USA +1 408.962.5487 (phone) +1 408.210.3569 (mobile) ------------------------------------------------- Internationalization is an architecture. It is not a feature. > -----Original Message----- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On > Behalf Of Philippe de Rochambeau > Sent: Thursday, September 12, 2002 1:09 PM > To: Addison Phillips [wM] > Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED] > Subject: Re: Problems converting from UTF-8 to UCS-2 and vice-versa > using JRun 3.1, SQL Server 2000, Windows 2000 and Java 3.1 > > > Hello, > > > String ucs2 = new String(byt, "UTF-8"); // turn them into a real > > UCS-2 string > > Isn't UCS-2, UTF-16? > > > byte[] byt = myString.getBytes("ISO8859_1"); // get the original > > UTF-8 bytes back > > String ucs2 = new String(byt, "UTF-8"); // turn them into a real > > UCS-2 string > > If I do the above, I get the questions marks back, whether I display > the data this way > > out.println( rfLibelle ); > > or that way > > out.println( new String( rfLibelle.getBytes(), "UTF-8" ) ); > > I think that is something wrong with either JRun 3.1, Windows 2000 or > SQL Server 2000 (or a combination of them). > > I don't any problems with Tomcat 4 + PostgreSQL on MacOSX. > > Best regards, > > Philippe de Rochambeau > > Le jeudi, 12 sep 2002, à 18:33 Europe/Paris, Addison Phillips [wM] a > écrit : > > > For some reason I don't the see the original email, so I'm going to > > guess based on Marco's response below. > > > > The code below is nearly correct, assuming that the starting point was > > that each UTF-8 byte was converted into a single java.lang.Character > > object in the String. That is, if the String contained the sequence > > U+00E8 U+00AA U+009E..., the code would be: > > > > byte[] byt = myString.getBytes("ISO8859_1"); // get the original > > UTF-8 bytes back > > String ucs2 = new String(byt, "UTF-8"); // turn them into a real > > UCS-2 string > > > > It is very important to name the encoding in the string constructor, > > otherwise the String constructor assumes the JVM's file.encoding---> > > most of the time. > > > > There is a annoying bug/feature in some JVMs on real Asian Windows > > (including 2K and XP) in which the file.encoding is ignored in favor > > of the actual System Active code page (SYS_ACP) and setting the > > -Dfile.encoding="someEncoding" doesn't work to change the String > > constructor's default behavior. You have to be careful always name the > > encoding, not just rely on the system to provide it. > > > > If your original byte[] is in a real CJK encoding, then you need to > > name that encoding instead of UTF-8 above (and you can do that by > > getting the file.encoding system parameter if you are running on the > > same platform, la so: > > > > byte[] byt = myString.getBytes("ISO8859_1"); > > String ucs2 = new String(byt, System.getParameter("file.encoding")); > > > > If the original byte[] is actually correctly formed and you want to > > get UTF-8, Marco's code is correct: > > > > byte[] utf8bytes = myString.getBytes("UTF-8"); > > > > Note that I have omitted try/catch blocks for clarity, but the > > compiler will insist on them... > > > > Hope that helps. > > > > Best Regards, > > > > Addison > > > > Addison P. Phillips > > Director, Globalization Architecture > > webMethods, Inc. > > 432 Lakeside Drive > > Sunnyvale, California, USA > > +1 408.962.5487 (phone) > > +1 408.210.3569 (mobile) > > ------------------------------------------------- > > Internationalization is an architecture. > > It is not a feature. > > > >> -----Original Message----- > >> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On > >> Behalf Of Marco Cimarosti > >> Sent: Thursday, September 12, 2002 4:51 AM > >> To: '[EMAIL PROTECTED]'; [EMAIL PROTECTED] > >> Subject: RE: Problems converting from UTF-8 to UCS-2 and vice-versa > >> using JRun 3.1, SQL Server 2000, Windows 2000 and Java 3.1 > >> > >> > >> Philippe de Rochambeau wrote: > >>> On the other hand, if I store the previous "go" character > >>> plus an unusual > >>> CJK ideogram whose Unicode equivalent is \u5439 (E5 90 B9 in UTF-8) > >>> in the DB and retrieve the data, JRun 3.1 will only display the first > >>> character in my form's textarea, plus a few invisible > >>> characters, and the > >>> database will contain the following hex values: > >>> > >>> E8 AA 9E E5 3F B9 20 20 20 20 20 20 0D 0A 0A > >>> > >>> As you can see, "go" is still there, but the following > >>> character (E5 3F B9) > >>> is not \u5439 (E5 90 B9). I cannot figure out how to fix this > >>> problem. > >>> > >>> Any help with this problem would be much appreciated. > >> > >> I see what the problem is. As usual, it's all the fault of Bill > >> Gate$. :-) > >> > >> If you interpret <E5, 90, B9> according to Windows-1252, you see > >> that E5 is > >> "å", B9 is "¹", but 90 is an unassigned slot! Unassigned characters > >> are > >> normally turned into a question marks, and "?"'s code is (guess > >> what) 3F... > >> > >> <E8, AA, 9E> this works only by chance, because all three bytes are > >> valid > >> Windows-1252 characters: "é", "ª", and "ž", respectively. > >> > >> I guess that the problem starts when you try to fool the system into > >> thinking that the text is ISO 8859-1: > >> > >> byte[] byt = (newQfLibelleArray[i]).getBytes( "ISO8859_1" ); > >> String tempUtf16 = new String( byt ); > >> > >> But, sorry. I can't help with a fix, because I don't know Java API's > >> well > >> enough. > >> > >> Can't you do something like <.getBytes("UTF-8")>? Or, even better, > >> doesn't > >> (newQfLibelleArray[i]) have a method to return a <String> object > >> directly? > >> > >> _ Marco > >> > >> > >> > >> > > > > > > >