Re: [VOTE] Re: 2.0 release - deprecate some methods?

secoskem Thu, 26 Jun 2003 14:39:26 -0700

I just got through Internationalizing a website... input and output.  I 
ran into the exact same issues, and as Andre states, you pretty much need 
to check everywhere for byte[] ->String and String->byte[].


Then do the conversions he's given.  I personally liked the more terse:

        byte[] outbytes = new String(inbytes, 
inputEncoding).getBytes(outputEncoding);

- Matt Secoske






André-John Mas <[EMAIL PROTECTED]>
06/26/2003 03:49 PM
Please respond to "Commons HttpClient Project"

 
        To:     Commons HttpClient Project <[EMAIL PROTECTED]>
        cc: 
        Subject:        Re: [VOTE] Re: 2.0 release - deprecate some methods?


This doesn't look correct, if you are really wanting to convert
from one charset to another then you would have to do something
such as:

    String myString = new String(bytes,bytesCharset);
    byte[] bytes2 = myString.getBytes(destCharset);

Until you have the bytes, you don't have the final output, since
strings will be affected by the platformas native encoding if
you aren't careful. Otherwise if your destination is an outputstream,
then let the OutputWriter do the work for you:

    String myString = new String(bytes,bytesCharset);
    OutputStreamWriter out = new
        OutputStreamWriter(outStream, destCharset)
    out.write(myString);

I have just had to write a project that is fully UTF-8 compliant
and it taught me a lot about what Java does. Without any encoding
specified the string conversion default to the platform native
format, which is not what you always want. I had to go everywhere
and make sure the right conversions were being performed.

regards

Andre

Laura Werner wrote:

> Adrian Sutton wrote:
> 
>> The flaw in the toUsingCharset method is two-fold:
>> Firstly, Strings in Java are *always* stored internally as UTF-8
> 
> 
> 
> I agree with the rest of your analysis of this, but I thought I should 
> point out that Java Strings and "char"s are stored in UTF-16 rather than 

> UTF-8.  A "char" is an unsigned, two-byte value that can hold all the 
> characters from UCS2.
> 
> As far as toUsingCharset goes, I agree that it looks broken.  The code 
> basically does:
> 
>            return new String(target.getBytes(fromCharset), toCharset);
> 
> It's taking "target", which is a UTF-16 string, encoding it into a byte 
> array in "fromCharset", and then decoding those bytes back into UTF-16 
> using "toCharset".  So it's pretendeing the bytes in the array have two 
> different meanings, one when it writes them and one when it reads them 
> immediately afterward.  I can't see how this could be correct.
> 
> -- Laura
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: 
> [EMAIL PROTECTED]
> For additional commands, e-mail: 
> [EMAIL PROTECTED]
> 
> 


-- 
André-John Mas
Software Developer / Développeur Informatique
Newtrade Technologies
63 de Brésoles, Suite 100, Montreal, Quebec, Canada H2Y 1V7
mailto:[EMAIL PROTECTED]
tel +1 514 286-8187 x3017
fax +1 514 221-3287

----------------------------------------------------------------------
If you have received this message in error, please notify the sender
immediately and delete the original without making a copy, disclosing
its contents or taking any action based thereon.

Si vous avez reçu ce message par erreur, veuillez en aviser
immédiatement le signataire et effacer l'original, sans en tirer de
copie, en dévoiler le contenu ni prendre quelque mesure fondée sur
celui-ci.





---------------------------------------------------------------------
To unsubscribe, e-mail: 
[EMAIL PROTECTED]
For additional commands, e-mail: 
[EMAIL PROTECTED]

Re: [VOTE] Re: 2.0 release - deprecate some methods?

Reply via email to