Re: Problems converting from UTF-8 to UCS-2 and vice-versa using JRun 3.1, SQL Server 2000, Windows 2000 and Java 3.1

Philippe de Rochambeau Thu, 12 Sep 2002 13:28:30 -0700

Hello,

> String ucs2 = new String(byt, "UTF-8");  // turn them into a real 
> UCS-2 string


Isn't UCS-2, UTF-16?

> byte[] byt = myString.getBytes("ISO8859_1");  // get the original 
> UTF-8 bytes back
> String ucs2 = new String(byt, "UTF-8");  // turn them into a real 
> UCS-2 string

If I do the above, I get the questions marks back, whether I display 
the data this way

out.println( rfLibelle );       
                
or that way

out.println( new String( rfLibelle.getBytes(), "UTF-8" ) );

I think that is something wrong with either JRun 3.1, Windows 2000 or 
SQL Server 2000 (or a combination of them).

I don't any problems with Tomcat 4 + PostgreSQL on MacOSX.

Best regards,

Philippe de Rochambeau

Le jeudi, 12 sep 2002, à 18:33 Europe/Paris, Addison Phillips [wM] a 
écrit :

> For some reason I don't the see the original email, so I'm going to 
> guess based on Marco's response below.
>
> The code below is nearly correct, assuming that the starting point was 
> that each UTF-8 byte was converted into a single java.lang.Character 
> object in the String. That is, if the String contained the sequence 
> U+00E8 U+00AA U+009E..., the code would be:
>
> byte[] byt = myString.getBytes("ISO8859_1");  // get the original 
> UTF-8 bytes back
> String ucs2 = new String(byt, "UTF-8");  // turn them into a real 
> UCS-2 string
>
> It is very important to name the encoding in the string constructor, 
> otherwise the String constructor assumes the JVM's file.encoding---> 
> most of the time.
>
> There is a annoying bug/feature in some JVMs on real Asian Windows 
> (including 2K and XP) in which the file.encoding is ignored in favor 
> of the actual System Active code page (SYS_ACP) and setting the 
> -Dfile.encoding="someEncoding" doesn't work to change the String 
> constructor's default behavior. You have to be careful always name the 
> encoding, not just rely on the system to provide it.
>
> If your original byte[] is in a real CJK encoding, then you need to 
> name that encoding instead of UTF-8 above (and you can do that by 
> getting the file.encoding system parameter if you are running on the 
> same platform, la so:
>
> byte[] byt = myString.getBytes("ISO8859_1");
> String ucs2 = new String(byt, System.getParameter("file.encoding"));
>
> If the original byte[] is actually correctly formed and you want to 
> get UTF-8, Marco's code is correct:
>
> byte[] utf8bytes = myString.getBytes("UTF-8");
>
> Note that I have omitted try/catch blocks for clarity, but the 
> compiler will insist on them...
>
> Hope that helps.
>
> Best Regards,
>
> Addison
>
> Addison P. Phillips
> Director, Globalization Architecture
> webMethods, Inc.
> 432 Lakeside Drive
> Sunnyvale, California, USA
> +1 408.962.5487 (phone)
> +1 408.210.3569 (mobile)
> -------------------------------------------------
> Internationalization is an architecture.
> It is not a feature.
>
>> -----Original Message-----
>> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
>> Behalf Of Marco Cimarosti
>> Sent: Thursday, September 12, 2002 4:51 AM
>> To: '[EMAIL PROTECTED]'; [EMAIL PROTECTED]
>> Subject: RE: Problems converting from UTF-8 to UCS-2 and vice-versa
>> using JRun 3.1, SQL Server 2000, Windows 2000 and Java 3.1
>>
>>
>> Philippe de Rochambeau wrote:
>>> On the other hand, if I store the previous "go" character
>>> plus an unusual
>>> CJK ideogram whose Unicode equivalent is \u5439 (E5 90 B9 in UTF-8)
>>> in the DB and retrieve the data, JRun 3.1 will only display the first
>>> character in my form's textarea, plus a few invisible
>>> characters, and the
>>> database will contain the following hex values:
>>>
>>> E8 AA 9E E5 3F B9 20 20 20 20 20 20 0D 0A 0A
>>>
>>> As you can see, "go" is still there, but the following
>>> character (E5 3F B9)
>>> is not \u5439 (E5 90 B9). I cannot figure out how to fix this 
>>> problem.
>>>
>>> Any help with this problem would be much appreciated.
>>
>> I see what the problem is. As usual, it's all the fault of Bill 
>> Gate$. :-)
>>
>> If you interpret <E5, 90, B9> according to Windows-1252, you see
>> that E5 is
>> "å", B9 is "¹", but 90 is an unassigned slot! Unassigned characters 
>> are
>> normally turned into a question marks, and "?"'s code is (guess
>> what) 3F...
>>
>> <E8, AA, 9E> this works only by chance, because all three bytes are 
>> valid
>> Windows-1252 characters: "é", "ª", and "ž", respectively.
>>
>> I guess that the problem starts when you try to fool the system into
>> thinking that the text is ISO 8859-1:
>>
>>      byte[] byt = (newQfLibelleArray[i]).getBytes( "ISO8859_1" );
>>      String tempUtf16 = new String( byt );
>>
>> But, sorry. I can't help with a fix, because I don't know Java API's 
>> well
>> enough.
>>
>> Can't you do something like <.getBytes("UTF-8")>? Or, even better, 
>> doesn't
>> (newQfLibelleArray[i]) have a method to return a <String> object 
>> directly?
>>
>> _ Marco
>>
>>
>>
>>
>
>

Re: Problems converting from UTF-8 to UCS-2 and vice-versa using JRun 3.1, SQL Server 2000, Windows 2000 and Java 3.1

Reply via email to