Because the following code got apply to your unicode data

1. convert \u to unicode -
\uFFE2\uFF80\uFF93
 become
three unicode characters-
U+FFE2, U+FF80, U+FF93
This is ok
2. a "Throw away hihg 8 bits got apply to your code" so
it became 3 bytes
E2 80 93

3. and some code treat it as UTF-8 and try to convert it to UCS2 again, so

E2 = 1110 0010 and the right most 4 bits 0010 will be used for UCS2
80 = 1000 0000 and the right most 6 bits 00 0000 will be used for UCS2
93 = 1001 0011 and the right most 6 bits 01 0011 will be used for UCS2

[0010] [00 0000] [01 0011] = 0010 0000 0001 0011 = 2013
U+2013 is EN DASH

so... in your code there are something very very bad which will corrupt your data.
Step 2 and 3 are very bad. You probably need to find out where they are and remove that code.

read my paper on http://people.netscape.com/ftang/paper/textintegrity.html
Probably your Java code have one or two bugs which listed in my paper.

Jain, Pankaj (MED, TCS) wrote:
James,
thanks, its working for me now.
But still I have a doubt that why \uFFE2\uFF80\uFF93 is giving ndash in
html.
if you have any information on this, than pls let me know.

Thanks
-Pankaj

-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]
Sent: Monday, March 10, 2003 7:59 PM
To: Jain, Pankaj (MED, TCS)
Cc: '[EMAIL PROTECTED]'
Subject: Re: Unicode character transformation through XSLT


.
Pankaj Jain wrote,

  
My problem is that, I am getting Unicode character(\uFFE2\uFF80\uFF93)
from resource bundle property file which is equivalent to ndash(-) and
its 
    

U+2013 is the ndash (aEUR").  It is represented in UTF-8 by three
hex bytes: E2 80 93.

But, \uFFE2 is fullwidth pound sign
\uFF80 is half width katakana letter ta
and \uff93 is half width katakana letter mo.

Perhaps the reason you see three question marks is that the font
you are using doesn't support full width and half width characters.

What happens if you replace your string \uFFE2\uFF80\uFF93 with
\u2013 ?

Best regards,

James Kass
.

  

Reply via email to