Charset Transformation - (previously) Re: [VOTE] Re: 2.0 release

Sung-Gu Fri, 11 Jul 2003 22:52:29 -0700

You got me!  ;)

Sung-Gu



----- Original Message ----- 
From: "André-John Mas" <[EMAIL PROTECTED]>
To: "Commons HttpClient Project" <[EMAIL PROTECTED]>
Sent: Friday, June 27, 2003 12:23 AM
Subject: Re: [VOTE] Re: 2.0 release - deprecate some methods?


> This doesn't look correct, if you are really wanting to convert
> from one charset to another then you would have to do something
> such as:
> 
>     String myString = new String(bytes,bytesCharset);
>     byte[] bytes2 = myString.getBytes(destCharset);
> 
> Until you have the bytes, you don't have the final output, since
> strings will be affected by the platformas native encoding if
> you aren't careful. Otherwise if your destination is an outputstream, 
> then let the OutputWriter do the work for you:
> 
>     String myString = new String(bytes,bytesCharset);
>     OutputStreamWriter out = new
>         OutputStreamWriter(outStream, destCharset)
>     out.write(myString);
> 
> I have just had to write a project that is fully UTF-8 compliant
> and it taught me a lot about what Java does. Without any encoding
> specified the string conversion default to the platform native
> format, which is not what you always want. I had to go everywhere
> and make sure the right conversions were being performed.
> 
> regards
> 
> Andre
> 
> Laura Werner wrote:
> 
> > Adrian Sutton wrote:
> > 
> >> The flaw in the toUsingCharset method is two-fold:
> >> Firstly, Strings in Java are *always* stored internally as UTF-8
> > 
> > 
> > 
> > I agree with the rest of your analysis of this, but I thought I should 
> > point out that Java Strings and "char"s are stored in UTF-16 rather than 
> > UTF-8.  A "char" is an unsigned, two-byte value that can hold all the 
> > characters from UCS2.
> > 
> > As far as toUsingCharset goes, I agree that it looks broken.  The code 
> > basically does:
> > 
> >            return new String(target.getBytes(fromCharset), toCharset);
> > 
> > It's taking "target", which is a UTF-16 string, encoding it into a byte 
> > array in "fromCharset", and then decoding those bytes back into UTF-16 
> > using "toCharset".  So it's pretendeing the bytes in the array have two 
> > different meanings, one when it writes them and one when it reads them 
> > immediately afterward.  I can't see how this could be correct.
> > 
> > -- Laura
> > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: 
> > [EMAIL PROTECTED]
> > For additional commands, e-mail: 
> > [EMAIL PROTECTED]
> > 
> > 
> 
> 
> -- 
> André-John Mas
> Software Developer / Développeur Informatique
> Newtrade Technologies
> 63 de Brésoles, Suite 100, Montreal, Quebec, Canada H2Y 1V7
> mailto:[EMAIL PROTECTED]
> tel +1 514 286-8187 x3017
> fax +1 514 221-3287
> 
> ----------------------------------------------------------------------
> If you have received this message in error, please notify the sender
> immediately and delete the original without making a copy, disclosing
> its contents or taking any action based thereon.
> 
> Si vous avez reçu ce message par erreur, veuillez en aviser
> immédiatement le signataire et effacer l'original, sans en tirer de
> copie, en dévoiler le contenu ni prendre quelque mesure fondée sur
> celui-ci.
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

Charset Transformation - (previously) Re: [VOTE] Re: 2.0 release

Reply via email to