Re: Unicode character transformation through XSLT
Nooo - Java's old UTF functions do not process UTF-8! They are there for String serialization, a Java-internal format. Use the Java Reader/Writer classes instead of these old ones! See the Java tutorials on Internationalization: http://java.sun.com/docs/books/tutorial/i18n/text/convertintro.html http://java.sun.com/docs/books/tutorial/i18n/text/index.html http://java.sun.com/docs/books/tutorial/i18n/index.html See the descriptions of readUTF() functions (highlighting with ***): http://java.sun.com/j2se/1.4/docs/api/java/io/DataInputStream.html#readUTF(java.io.DataInput) Reads from the stream in a representation of a Unicode character string encoded in ***Java modified UTF-8*** format; this string of characters is then returned as a String. The details of the ***modified UTF-8*** representation are exactly the same as for the readUTF method of DataInput. http://java.sun.com/j2se/1.4/docs/api/java/io/DataInput.html#readUTF() Java's *modified* UTF-8 in its UTF functions resembles CESU-8, and writes U+ with two bytes instead of one, as far as I remember. markus Yung-Fong Tang wrote: what is rsResult? Blob? you probably need to use BufferedInputStream and DataInputStream to pipe the InputStream and use readChar or readUTF in the InputStream interface instad. See http://www.webdeveloper.com/java/java_jj_read_write.html and http://java.sun.com/j2se/1.4/docs/api/java/io/DataInputStream.html#readUTF() for more info. -- Opinions expressed here may not reflect my company's positions unless otherwise noted.
Re: Unicode character transformation through XSLT
Jain, Pankaj (MED, TCS) schreef: I modified my program as per your suggestion(modified to byChunk127) , Sorry, I was much too hasty with my reply. First of all, I should have written byChunk255. And secondly, solutions like the one Markus proposes are much better thought out. My apologies. Pim Blokland
Re: Unicode character transformation through XSLT
I have not touch Java for years (probably 5 years) ... so, I could be wrong. Jain, Pankaj (MED, TCS) wrote: Hi ftang/james.. thanks for the details explanation. and now I the root problem of my error. I have following string is in database as Long in which the special character(?) is equivalent to ndash(-) E8C ? 6 to 10 And i am using following code to write the string from database to property file, and in property file i am getting following string. value= E8C \uFFE2\uFF80\uFF93 6 to 10 And as \uFFE2\uFF80\uFF93 is not equivalent to ndash, I am not able to figure out why it is coming in property file. Do we need to specify in my java program any type of encoding like utf-8. pls let me know where is the problem. Here is my code.. while(rsResult.next()) { /*Get the file contents from the value column*/ ipStream = rsResult.getBinaryStream("VALUE"); what is rsResult? Blob? you probably need to use BufferedInputStream and DataInputStream to pipe the InputStream and use readChar or readUTF in the InputStream interface instad. See http://www.webdeveloper.com/java/java_jj_read_write.html and http://java.sun.com/j2se/1.4/docs/api/java/io/DataInputStream.html#readUTF() for more info. strBuf = new StringBuffer(); while((chunk = ipStream.read())!=-1) { byte byChunk = new Integer(chunk).byteValue(); strBuf.append((char) byChunk); } Here is your problem, you read it in byte to byte. Each byte of the UTF-8 will be read in as a Byte instead of a Char in Java. prop.setProperty(rsResult.getString("KEY"), strBuf.toString()); } /*Write to o/p stream*/ //opFile = new FileOutputStream(strFileName+".properties"); opFile = new FileOutputStream(strFileName); /*Store the Properties files*/ prop.store(opFile, "Resource Bundle created from Database View "+vctView.get(i)); Thnaks -Pankaj -Original Message- From: [EMAIL PROTECTED][mailto:[EMAIL PROTECTED]] Sent: Tuesday, March 11, 2003 6:09PM To: Jain, Pankaj (MED, TCS) Cc: '[EMAIL PROTECTED]'; '[EMAIL PROTECTED]' Subject: Re: Unicode character transformationthrough XSLT Because the following code got apply toyour unicode data 1. convert \u to unicode - \uFFE2\uFF80\uFF93 become three unicode characters- U+FFE2, U+FF80, U+FF93 This is ok 2. a "Throw away hihg 8bits got apply to your code" so it became 3 bytes E2 80 93 3. andsome code treat it as UTF-8 and try to convert it to UCS2 again, so E2= 1110 0010 and the right most 4 bits 0010 will be used for UCS2 80 = 1000 and the right most 6 bits 00 will be used for UCS2 93 = 1001 0011and the right most 6 bits 01 0011 will be used for UCS2 [0010] [00] [01 0011] = 0010 0001 0011 = 2013 U+2013 is EN DASH so...in your code there are something very very bad which will corrupt yourdata. Step 2 and 3 are very bad. You probably need to find out where theyare and remove that code. read my paper on http://people.netscape.com/ftang/paper/textintegrity.html Probablyyour Java code have one or two bugs which listed in my paper. Jain,Pankaj (MED, TCS) wrote: James, thanks, its working for me now. But still I have a doubt that why \uFFE2\uFF80\uFF93 is giving ndash in html. if you have any information on this, than pls let me know. Thanks -Pankaj -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] Sent: Monday, March 10, 2003 7:59 PM To: Jain, Pankaj (MED, TCS) Cc: '[EMAIL PROTECTED]' Subject: Re: Unicode character transformation through XSLT . Pankaj Jain wrote, My problem is that, I am getting Unicode character(\uFFE2\uFF80\uFF93) from resource bundle property file which is equivalent to ndash(-) and its U+2013 is the ndash (aEUR"). It is represented in UTF-8 by three hex bytes: E2 80 93. But, \uFFE2 is fullwidth pound sign \uFF80 is half width katakana letter ta and \uff93 is half width katakana letter mo. Perhaps the reason you see three question marks is that the font you are using doesn't support full width and half width characters. What happens if you replace your string \uFFE2\uFF80\uFF93 with \u2013 ? Best regards, James Kass .
Re: Unicode character transformation through XSLT
Pim Blokland scripsit: As I understand it, char is a signed 16 bits type in Java; any of the others may be unsigned. Hence the problem. Char is *unsigned*, all the others are always signed. -- May the hair on your toes never fall out! John Cowan --Thorin Oakenshield (to Bilbo) [EMAIL PROTECTED]
Re: Unicode character transformation through XSLT
Generally, try instantiating an InputStreamReader or similar from your input, with an explicit encoding=UTF8. That will perform the conversion from UTF-8 to the internal 16-bit Unicode that Java processes. Always use XYZReader classes for text input and XYZWriter classes for text output. java.sun.com has tutorials on Internationalization etc. that I recommend. See also http://oss.software.ibm.com/icu/docs/papers/forms_of_unicode/ Your code takes UTF-8 byte values, mis-casts them to signed then unsigned 16-bit values and re-interprets these mistreated UTF-8 byte values as if they were 16-bit UTF-16 code units. Let's take this line by line to see what happens: Jain, Pankaj (MED, TCS) wrote: Here is my code.. while(rsResult.next()) { /*Get the file contents from the value column*/ ipStream = rsResult.getBinaryStream(VALUE); This is the source of the problem. You read the input as binary instead of as UTF-8 text. strBuf = new StringBuffer(); while((chunk = ipStream.read())!=-1) { byte byChunk = new Integer(chunk).byteValue(); Now you get one byte at a time. In Java, byte is a signed type, so 0x80..0xff are actually negative values: 0x80=-128 .. 0xff=-1. strBuf.append((char) byChunk); This widens the signed integer value to 16 bits and then casts it to an unsigned 16-bit unit (Java char is 16 bits wide). 0x80 became negative (-128), was widened to 16 bits and cast to unsigned, which is 0xff80. You append this mistreated value to a StringBuffer which reinterprets it as a UTF-16 code unit. } prop.setProperty(rsResult.getString(KEY), strBuf.toString()); } markus
Re: Unicode character transformation through XSLT
Kenneth Whistler wrote: Unicode character (\uFFE2\uFF80\uFF93) ... What you are actually looking for is the UTF-8 sequence: 0xE2 0x80 0x93 The 8-bit UTF-8 bytes E2 80 93 (all with the most significant bit set) get *sign-extended* to 16 bits, producing FFE2 FF80 FF93. It should suffice in a UTF-8 string literal to rewrite this as \xE2\x80\x93. Otherwise, find out where the 16-bit-widening/sign-extension occurs. markus
RE: Unicode character transformation through XSLT
James, thanks, its working for me now. But still I have a doubt that why \uFFE2\uFF80\uFF93 is giving ndash in html. if you have any information on this, than pls let me know. Thanks -Pankaj -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Monday, March 10, 2003 7:59 PM To: Jain, Pankaj (MED, TCS) Cc: '[EMAIL PROTECTED]' Subject: Re: Unicode character transformation through XSLT . Pankaj Jain wrote, My problem is that, I am getting Unicode character(\uFFE2\uFF80\uFF93) from resource bundle property file which is equivalent to ndash(-) and its U+2013 is the ndash (aEUR). It is represented in UTF-8 by three hex bytes: E2 80 93. But, \uFFE2 is fullwidth pound sign \uFF80 is half width katakana letter ta and \uff93 is half width katakana letter mo. Perhaps the reason you see three question marks is that the font you are using doesn't support full width and half width characters. What happens if you replace your string \uFFE2\uFF80\uFF93 with \u2013 ? Best regards, James Kass .
Re: Unicode character transformation through XSLT
Jain, Pankaj (MED, TCS) schreef: But still I have a doubt that why \uFFE2\uFF80\uFF93 is giving ndash in html. In html? No way! Html can't interpret series of hex bytes. Try ndash; or #8211;. Pim Blokland
Re: Unicode character transformation through XSLT
Because the following code got apply to your unicode data 1. convert \u to unicode - \uFFE2\uFF80\uFF93 become three unicode characters- U+FFE2, U+FF80, U+FF93 This is ok 2. a "Throw away hihg 8 bits got apply to your code" so it became 3 bytes E2 80 93 3. and some code treat it as UTF-8 and try to convert it to UCS2 again, so E2 = 1110 0010 and the right most 4 bits 0010 will be used for UCS2 80 = 1000 and the right most 6 bits 00 will be used for UCS2 93 = 1001 0011 and the right most 6 bits 01 0011 will be used for UCS2 [0010] [00 ] [01 0011] = 0010 0001 0011 = 2013 U+2013 is EN DASH so... in your code there are something very very bad which will corrupt your data. Step 2 and 3 are very bad. You probably need to find out where they are and remove that code. read my paper on http://people.netscape.com/ftang/paper/textintegrity.html Probably your Java code have one or two bugs which listed in my paper. Jain, Pankaj (MED, TCS) wrote: James, thanks, its working for me now. But still I have a doubt that why \uFFE2\uFF80\uFF93 is giving ndash in html. if you have any information on this, than pls let me know. Thanks -Pankaj -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] Sent: Monday, March 10, 2003 7:59 PM To: Jain, Pankaj (MED, TCS) Cc: '[EMAIL PROTECTED]' Subject: Re: Unicode character transformation through XSLT . Pankaj Jain wrote, My problem is that, I am getting Unicode character(\uFFE2\uFF80\uFF93) from resource bundle property file which is equivalent to ndash(-) and its U+2013 is the ndash (aEUR"). It is represented in UTF-8 by three hex bytes: E2 80 93. But, \uFFE2 is fullwidth pound sign \uFF80 is half width katakana letter ta and \uff93 is half width katakana letter mo. Perhaps the reason you see three question marks is that the font you are using doesn't support full width and half width characters. What happens if you replace your string \uFFE2\uFF80\uFF93 with \u2013 ? Best regards, James Kass .
Unicode character transformation through XSLT
Hi My problem is that, I am getting Unicode character(\uFFE2\uFF80\uFF93) from resource bundle property file which is equivalent to ndash(-) and its works fine in html and XML but whileTransformation through XSLT, it unable to interpret it. and hence in I am getting???in stead of ndash. But if pass ndash from resource bundle and in xml if I declare !DOCTYPE xsl:stylesheet [!ENTITY ndash "#8211;"], than i am able tosee proper output. In XML I am using UTF-8 encoding. So let me know how I can use Unicode character (\uFFE2\uFF80\uFF93) in XSL to resolve my issue because I will get only Unicode character from property file. pls help in this area and let me know how to implement above. Thanks Regards, Pankaj Jain. GE Medical System, Waukesha, WI-53186 Contact no- 1 (262) 547 0363
Re: Unicode character transformation through XSLT
. Pankaj Jain wrote, My problem is that, I am getting Unicode character(\uFFE2\uFF80\uFF93) from resource bundle property file which is equivalent to ndash(-) and its U+2013 is the ndash (–). It is represented in UTF-8 by three hex bytes: E2 80 93. But, \uFFE2 is fullwidth pound sign \uFF80 is half width katakana letter ta and \uff93 is half width katakana letter mo. Perhaps the reason you see three question marks is that the font you are using doesn't support full width and half width characters. What happens if you replace your string \uFFE2\uFF80\uFF93 with \u2013 ? Best regards, James Kass .
Re: Unicode character transformation through XSLT
Well, I can't diagnose exactly what is going wrong, but Unicode character (\uFFE2\uFF80\uFF93) is a sequence of a full-width not sign, followed by a half-width katakana ta and a half-width katakana mo. What you are actually looking for is the UTF-8 sequence: 0xE2 0x80 0x93 which is the UTF-8 equivalent of U+2013 EN DASH. (and of !ENTITY ndash #8211;) It appears that something in the way you (or the code you are using) is getting Unicode characters from the resource bundle is incorrectly converting 0xE2 -- 0xFFE2, and so on. --Ken Hi My problem is that, I am getting Unicode character(\uFFE2\uFF80\uFF93) from resource bundle property file which is equivalent to ndash(-) and its works fine in html and XML but while Transformation through XSLT, it unable to interpret it. and hence in I am getting ???in stead of ndash. But if pass ndash from resource bundle and in xml if I declare !DOCTYPE xsl:stylesheet [!ENTITY ndash #8211;], than i am able to see proper output. In XML I am using UTF-8 encoding. So let me know how I can use Unicode character (\uFFE2\uFF80\uFF93) in XSL to resolve my issue because I will get only Unicode character from property file.