Re: Unicode character transformation through XSLT

Markus Scherer Wed, 12 Mar 2003 15:29:29 -0800

Generally, try instantiating an InputStreamReader or similar from your input, with an explicit encoding="UTF8". That will perform the conversion from UTF-8 to the internal 16-bit Unicode that Java processes.

Always use XYZReader classes for text input and XYZWriter classes for text output.

java.sun.com has tutorials on Internationalization etc. that I recommend.
See also http://oss.software.ibm.com/icu/docs/papers/forms_of_unicode/

Your code takes UTF-8 byte values, mis-casts them to signed then unsigned 16-bit values and re-interprets these mistreated UTF-8 byte values as if they were 16-bit UTF-16 code units.

Let's take this line by line to see what happens:

Jain, Pankaj (MED, TCS) wrote:

Here is my code..

while(rsResult.next())
{
/*Get the file contents from the value column*/
ipStream = rsResult.getBinaryStream("VALUE");

This is the source of the problem. You read the input as binary instead of as UTF-8 text.

strBuf = new StringBuffer();
while((chunk = ipStream.read())!=-1)
{
byte byChunk = new Integer(chunk).byteValue();

Now you get one byte at a time. In Java, byte is a signed type, so 0x80..0xff are actually negative values: 0x80=-128 .. 0xff=-1.

strBuf.append((char) byChunk);

This widens the signed integer value to 16 bits and then casts it to an unsigned 16-bit unit (Java char is 16 bits wide). 0x80 became negative (-128), was widened to 16 bits and cast to unsigned, which is 0xff80. You append this mistreated value to a StringBuffer which reinterprets it as a UTF-16 code unit.

}
prop.setProperty(rsResult.getString("KEY"), strBuf.toString());
}

markus

Re: Unicode character transformation through XSLT

Reply via email to