Hello, > String ucs2 = new String(byt, "UTF-8"); // turn them into a real > UCS-2 string
Isn't UCS-2, UTF-16? > byte[] byt = myString.getBytes("ISO8859_1"); // get the original > UTF-8 bytes back > String ucs2 = new String(byt, "UTF-8"); // turn them into a real > UCS-2 string If I do the above, I get the questions marks back, whether I display the data this way out.println( rfLibelle ); or that way out.println( new String( rfLibelle.getBytes(), "UTF-8" ) ); I think that is something wrong with either JRun 3.1, Windows 2000 or SQL Server 2000 (or a combination of them). I don't any problems with Tomcat 4 + PostgreSQL on MacOSX. Best regards, Philippe de Rochambeau Le jeudi, 12 sep 2002, à 18:33 Europe/Paris, Addison Phillips [wM] a écrit : > For some reason I don't the see the original email, so I'm going to > guess based on Marco's response below. > > The code below is nearly correct, assuming that the starting point was > that each UTF-8 byte was converted into a single java.lang.Character > object in the String. That is, if the String contained the sequence > U+00E8 U+00AA U+009E..., the code would be: > > byte[] byt = myString.getBytes("ISO8859_1"); // get the original > UTF-8 bytes back > String ucs2 = new String(byt, "UTF-8"); // turn them into a real > UCS-2 string > > It is very important to name the encoding in the string constructor, > otherwise the String constructor assumes the JVM's file.encoding---> > most of the time. > > There is a annoying bug/feature in some JVMs on real Asian Windows > (including 2K and XP) in which the file.encoding is ignored in favor > of the actual System Active code page (SYS_ACP) and setting the > -Dfile.encoding="someEncoding" doesn't work to change the String > constructor's default behavior. You have to be careful always name the > encoding, not just rely on the system to provide it. > > If your original byte[] is in a real CJK encoding, then you need to > name that encoding instead of UTF-8 above (and you can do that by > getting the file.encoding system parameter if you are running on the > same platform, la so: > > byte[] byt = myString.getBytes("ISO8859_1"); > String ucs2 = new String(byt, System.getParameter("file.encoding")); > > If the original byte[] is actually correctly formed and you want to > get UTF-8, Marco's code is correct: > > byte[] utf8bytes = myString.getBytes("UTF-8"); > > Note that I have omitted try/catch blocks for clarity, but the > compiler will insist on them... > > Hope that helps. > > Best Regards, > > Addison > > Addison P. Phillips > Director, Globalization Architecture > webMethods, Inc. > 432 Lakeside Drive > Sunnyvale, California, USA > +1 408.962.5487 (phone) > +1 408.210.3569 (mobile) > ------------------------------------------------- > Internationalization is an architecture. > It is not a feature. > >> -----Original Message----- >> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On >> Behalf Of Marco Cimarosti >> Sent: Thursday, September 12, 2002 4:51 AM >> To: '[EMAIL PROTECTED]'; [EMAIL PROTECTED] >> Subject: RE: Problems converting from UTF-8 to UCS-2 and vice-versa >> using JRun 3.1, SQL Server 2000, Windows 2000 and Java 3.1 >> >> >> Philippe de Rochambeau wrote: >>> On the other hand, if I store the previous "go" character >>> plus an unusual >>> CJK ideogram whose Unicode equivalent is \u5439 (E5 90 B9 in UTF-8) >>> in the DB and retrieve the data, JRun 3.1 will only display the first >>> character in my form's textarea, plus a few invisible >>> characters, and the >>> database will contain the following hex values: >>> >>> E8 AA 9E E5 3F B9 20 20 20 20 20 20 0D 0A 0A >>> >>> As you can see, "go" is still there, but the following >>> character (E5 3F B9) >>> is not \u5439 (E5 90 B9). I cannot figure out how to fix this >>> problem. >>> >>> Any help with this problem would be much appreciated. >> >> I see what the problem is. As usual, it's all the fault of Bill >> Gate$. :-) >> >> If you interpret <E5, 90, B9> according to Windows-1252, you see >> that E5 is >> "å", B9 is "¹", but 90 is an unassigned slot! Unassigned characters >> are >> normally turned into a question marks, and "?"'s code is (guess >> what) 3F... >> >> <E8, AA, 9E> this works only by chance, because all three bytes are >> valid >> Windows-1252 characters: "é", "ª", and "ž", respectively. >> >> I guess that the problem starts when you try to fool the system into >> thinking that the text is ISO 8859-1: >> >> byte[] byt = (newQfLibelleArray[i]).getBytes( "ISO8859_1" ); >> String tempUtf16 = new String( byt ); >> >> But, sorry. I can't help with a fix, because I don't know Java API's >> well >> enough. >> >> Can't you do something like <.getBytes("UTF-8")>? Or, even better, >> doesn't >> (newQfLibelleArray[i]) have a method to return a <String> object >> directly? >> >> _ Marco >> >> >> >> > >