Re: Unicode character transformation through XSLT

2003-03-14 Thread Markus Scherer
Nooo - Java's old UTF functions do not process UTF-8! They are there for String serialization, a 
Java-internal format.
Use the Java Reader/Writer classes instead of these old ones!

See the Java tutorials on Internationalization:
http://java.sun.com/docs/books/tutorial/i18n/text/convertintro.html
http://java.sun.com/docs/books/tutorial/i18n/text/index.html
http://java.sun.com/docs/books/tutorial/i18n/index.html
See the descriptions of readUTF() functions (highlighting with ***):

http://java.sun.com/j2se/1.4/docs/api/java/io/DataInputStream.html#readUTF(java.io.DataInput)

Reads from the stream in a representation of a Unicode character string encoded in ***Java modified 
UTF-8*** format; this string of characters is then returned as a String. The details of the 
***modified UTF-8*** representation are exactly the same as for the readUTF  method of DataInput.

http://java.sun.com/j2se/1.4/docs/api/java/io/DataInput.html#readUTF()

Java's *modified* UTF-8 in its UTF functions resembles CESU-8, and writes U+ with two bytes 
instead of one, as far as I remember.

markus

Yung-Fong Tang wrote:
what is rsResult? Blob?
you probably need to use
BufferedInputStream

and

DataInputStream

 to pipe the InputStream
and use readChar or readUTF in the InputStream interface instad.
See http://www.webdeveloper.com/java/java_jj_read_write.html and
http://java.sun.com/j2se/1.4/docs/api/java/io/DataInputStream.html#readUTF() 
for more info.
--
Opinions expressed here may not reflect my company's positions unless otherwise noted.



Re: Unicode character transformation through XSLT

2003-03-13 Thread Pim Blokland
Jain, Pankaj (MED, TCS) schreef:

 I modified my program as per your suggestion(modified to
byChunk127) ,

Sorry, I was much too hasty with my reply. First of all, I should
have written byChunk255. And secondly, solutions like the one
Markus proposes are much better thought out.
My apologies.

Pim Blokland





Re: Unicode character transformation through XSLT

2003-03-13 Thread Yung-Fong Tang




I have not touch Java for years (probably 5 years) ... so, I could be wrong.


Jain, Pankaj (MED, TCS) wrote:
 
 
  
  
   
  
 
   
  Hi  ftang/james..
 
  thanks for the details
 explanation. and now I the root problem of my error.
 
  I have following string
is in  database as Long in which the special character(?) is equivalent to
 ndash(-)
 
  E8C ? 6 to 10 
 
  And i am using following
code to  write the string from database to property file, and in property
file i am  getting following string.
 
  value=  E8C \uFFE2\uFF80\uFF93 6 to 10 
 
  And as  \uFFE2\uFF80\uFF93 is not equivalent to ndash, I am not
able to figure out why  it is coming in property file.
 
  Do we  need to specify in my java program any type of encoding
like  utf-8.
 
  pls let  me know where is the problem.
 
  Here is  my code..
 
  while(rsResult.next())
 
  {
 
  /*Get the file contents from the value column*/
 
  ipStream = rsResult.getBinaryStream("VALUE");
  

what is rsResult? Blob?
you probably need to use 
BufferedInputStream
and 
DataInputStream
to pipe the InputStream
and use readChar or readUTF in the InputStream interface instad.
See http://www.webdeveloper.com/java/java_jj_read_write.html and 
http://java.sun.com/j2se/1.4/docs/api/java/io/DataInputStream.html#readUTF()
for more info.



   
  strBuf = new StringBuffer();
 
  while((chunk = ipStream.read())!=-1)
 
  {
 
  byte byChunk = new Integer(chunk).byteValue();
 
  strBuf.append((char) byChunk);
 
  }
  

Here is your problem, you read it in byte to byte. Each byte of the UTF-8
will be read in as a Byte instead of a Char in Java.


   
  prop.setProperty(rsResult.getString("KEY"),  strBuf.toString());
 
  }
 
  /*Write to o/p stream*/
 
  //opFile = new  FileOutputStream(strFileName+".properties");
 
  opFile = new FileOutputStream(strFileName);
 
  /*Store the Properties files*/
 
  prop.store(opFile, "Resource Bundle created from Database
View  "+vctView.get(i));
 


   
  
  
  Thnaks
 
  -Pankaj
 
  
 
  
 
  
 
  
  
 
 
-Original Message-
From: [EMAIL PROTECTED][mailto:[EMAIL PROTECTED]]
Sent: Tuesday, March 11, 2003 6:09PM
To: Jain, Pankaj (MED, TCS)
Cc: '[EMAIL PROTECTED]';    '[EMAIL PROTECTED]'
Subject: Re: Unicode character transformationthrough XSLT



Because the following code got apply toyour unicode data

1. convert \u to unicode - 
\uFFE2\uFF80\uFF93
become
three unicode characters-  
U+FFE2, U+FF80, U+FF93
This is ok
2. a "Throw away hihg 8bits got apply to your code" so
it became 3 bytes
E2 80 93

3. andsome code treat it as UTF-8 and try to convert it to UCS2 again,
so 

E2= 1110 0010 and the right most 4 bits 0010 will be used for UCS2
80 = 1000 and the right most 6 bits 00  will be used for UCS2
93 = 1001 0011and the right most 6 bits 01 0011 will be used for UCS2

[0010] [00] [01 0011] = 0010  0001 0011 = 2013
U+2013 is EN DASH

so...in your code there are something very very bad which will corrupt
yourdata.
Step 2 and 3 are very bad. You probably need to find out where theyare
and remove that code. 

read my paper on http://people.netscape.com/ftang/paper/textintegrity.html
Probablyyour Java code have one or two bugs which listed in my paper.


Jain,Pankaj (MED, TCS) wrote:
   

  James,
thanks, its working for me now.
But still I have a doubt that why \uFFE2\uFF80\uFF93 is giving ndash in
html.
if you have any information on this, than pls let me know.

Thanks
-Pankaj

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]
Sent: Monday, March 10, 2003 7:59 PM
To: Jain, Pankaj (MED, TCS)
Cc: '[EMAIL PROTECTED]'
Subject: Re: Unicode character transformation through XSLT


.
Pankaj Jain wrote,

  
 
  
My problem is that, I am getting Unicode character(\uFFE2\uFF80\uFF93)
from resource bundle property file which is equivalent to ndash(-) and
its 

  
  
U+2013 is the ndash (aEUR").  It is represented in UTF-8 by three
hex bytes: E2 80 93.

But, \uFFE2 is fullwidth pound sign
\uFF80 is half width katakana letter ta
and \uff93 is half width katakana letter mo.

Perhaps the reason you see three question marks is that the font
you are using doesn't support full width and half width characters.

What happens if you replace your string \uFFE2\uFF80\uFF93 with
\u2013 ?

Best regards,

James Kass
.

  


  






Re: Unicode character transformation through XSLT

2003-03-12 Thread John Cowan
Pim Blokland scripsit:

 As I understand it, char is a signed 16 bits type in Java; any of
 the others may be unsigned. Hence the problem. 

Char is *unsigned*, all the others are always signed.

-- 
May the hair on your toes never fall out! John Cowan
--Thorin Oakenshield (to Bilbo) [EMAIL PROTECTED]



Re: Unicode character transformation through XSLT

2003-03-12 Thread Markus Scherer
Generally, try instantiating an InputStreamReader or similar from your input, with an explicit 
encoding=UTF8. That will perform the conversion from UTF-8 to the internal 16-bit Unicode that 
Java processes.

Always use XYZReader classes for text input and XYZWriter classes for text output.

java.sun.com has tutorials on Internationalization etc. that I recommend.
See also http://oss.software.ibm.com/icu/docs/papers/forms_of_unicode/
Your code takes UTF-8 byte values, mis-casts them to signed then unsigned 16-bit values and 
re-interprets these mistreated UTF-8 byte values as if they were 16-bit UTF-16 code units.

Let's take this line by line to see what happens:

Jain, Pankaj (MED, TCS) wrote:
Here is my code..

while(rsResult.next())
{
/*Get the file contents from the value column*/
ipStream = rsResult.getBinaryStream(VALUE);
This is the source of the problem. You read the input as binary instead of as UTF-8 text.

strBuf = new StringBuffer();
while((chunk = ipStream.read())!=-1)
{
byte byChunk = new Integer(chunk).byteValue();
Now you get one byte at a time. In Java, byte is a signed type, so 0x80..0xff are actually negative 
values: 0x80=-128 .. 0xff=-1.

strBuf.append((char) byChunk);
This widens the signed integer value to 16 bits and then casts it to an unsigned 16-bit unit (Java 
char is 16 bits wide). 0x80 became negative (-128), was widened to 16 bits and cast to unsigned, 
which is 0xff80. You append this mistreated value to a StringBuffer which reinterprets it as a 
UTF-16 code unit.

}
prop.setProperty(rsResult.getString(KEY), strBuf.toString());
}
markus




Re: Unicode character transformation through XSLT

2003-03-11 Thread Markus Scherer
Kenneth Whistler wrote:
Unicode character (\uFFE2\uFF80\uFF93)
 ...
What you are actually looking for is the UTF-8 sequence:

0xE2 0x80 0x93
The 8-bit UTF-8 bytes E2 80 93 (all with the most significant bit set) get *sign-extended* to 16 
bits, producing FFE2 FF80 FF93. It should suffice in a UTF-8 string literal to rewrite this as 
\xE2\x80\x93. Otherwise, find out where the 16-bit-widening/sign-extension occurs.

markus




RE: Unicode character transformation through XSLT

2003-03-11 Thread Jain, Pankaj (MED, TCS)
James,
thanks, its working for me now.
But still I have a doubt that why \uFFE2\uFF80\uFF93 is giving ndash in
html.
if you have any information on this, than pls let me know.

Thanks
-Pankaj

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Sent: Monday, March 10, 2003 7:59 PM
To: Jain, Pankaj (MED, TCS)
Cc: '[EMAIL PROTECTED]'
Subject: Re: Unicode character transformation through XSLT


.
Pankaj Jain wrote,

 My problem is that, I am getting Unicode character(\uFFE2\uFF80\uFF93)
 from resource bundle property file which is equivalent to ndash(-) and
 its 

U+2013 is the ndash (aEUR).  It is represented in UTF-8 by three
hex bytes: E2 80 93.

But, \uFFE2 is fullwidth pound sign
\uFF80 is half width katakana letter ta
and \uff93 is half width katakana letter mo.

Perhaps the reason you see three question marks is that the font
you are using doesn't support full width and half width characters.

What happens if you replace your string \uFFE2\uFF80\uFF93 with
\u2013 ?

Best regards,

James Kass
.



Re: Unicode character transformation through XSLT

2003-03-11 Thread Pim Blokland
Jain, Pankaj (MED, TCS) schreef:

 But still I have a doubt that why \uFFE2\uFF80\uFF93 is giving
ndash in
 html.

In html? No way! Html can't interpret series of hex bytes. Try
ndash; or #8211;.

Pim Blokland





Re: Unicode character transformation through XSLT

2003-03-11 Thread Yung-Fong Tang





Because the following code got apply to your unicode data

1. convert \u to unicode - 
\uFFE2\uFF80\uFF93
become
three unicode characters- 
U+FFE2, U+FF80, U+FF93
This is ok
2. a "Throw away hihg 8 bits got apply to your code" so
it became 3 bytes
E2 80 93

3. and some code treat it as UTF-8 and try to convert it to UCS2 again, so


E2 = 1110 0010 and the right most 4 bits 0010 will be used for UCS2
80 = 1000  and the right most 6 bits 00  will be used for UCS2
93 = 1001 0011 and the right most 6 bits 01 0011 will be used for UCS2

[0010] [00 ] [01 0011] = 0010  0001 0011 = 2013
U+2013 is EN DASH

so... in your code there are something very very bad which will corrupt your
data.
Step 2 and 3 are very bad. You probably need to find out where they are and
remove that code. 

read my paper on http://people.netscape.com/ftang/paper/textintegrity.html
Probably your Java code have one or two bugs which listed in my paper. 

Jain, Pankaj (MED, TCS) wrote:

  James,
thanks, its working for me now.
But still I have a doubt that why \uFFE2\uFF80\uFF93 is giving ndash in
html.
if you have any information on this, than pls let me know.

Thanks
-Pankaj

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]
Sent: Monday, March 10, 2003 7:59 PM
To: Jain, Pankaj (MED, TCS)
Cc: '[EMAIL PROTECTED]'
Subject: Re: Unicode character transformation through XSLT


.
Pankaj Jain wrote,

  
  
My problem is that, I am getting Unicode character(\uFFE2\uFF80\uFF93)
from resource bundle property file which is equivalent to ndash(-) and
its 

  
  
U+2013 is the ndash (aEUR").  It is represented in UTF-8 by three
hex bytes: E2 80 93.

But, \uFFE2 is fullwidth pound sign
\uFF80 is half width katakana letter ta
and \uff93 is half width katakana letter mo.

Perhaps the reason you see three question marks is that the font
you are using doesn't support full width and half width characters.

What happens if you replace your string \uFFE2\uFF80\uFF93 with
\u2013 ?

Best regards,

James Kass
.

  






Unicode character transformation through XSLT

2003-03-10 Thread Jain, Pankaj (MED, TCS)





Hi 

My problem is that, I am getting Unicode character(\uFFE2\uFF80\uFF93) from resource bundle property file which is 
equivalent to ndash(-) and its works fine in html and XML but whileTransformation through XSLT, it unable to 
interpret it. and hence in I am getting???in stead of ndash.
But if 
pass ndash from resource bundle and in xml if I declare !DOCTYPE 
xsl:stylesheet [!ENTITY ndash "#8211;"], than i am able 
tosee proper 
output.
In XML 
I am using UTF-8 encoding.
So let 
me know how I can use Unicode character (\uFFE2\uFF80\uFF93) in XSL to resolve 
my issue because I will get only Unicode character from property 
file.

pls 
help in this area and let me know how to implement above.

Thanks 
 Regards,
Pankaj 
Jain.
GE 
Medical System,
Waukesha, WI-53186

Contact no- 1 (262) 547 0363



Re: Unicode character transformation through XSLT

2003-03-10 Thread jameskass
.
Pankaj Jain wrote,

 My problem is that, I am getting Unicode character(\uFFE2\uFF80\uFF93)
 from resource bundle property file which is equivalent to ndash(-) and
 its 

U+2013 is the ndash (–).  It is represented in UTF-8 by three
hex bytes: E2 80 93.

But, \uFFE2 is fullwidth pound sign
\uFF80 is half width katakana letter ta
and \uff93 is half width katakana letter mo.

Perhaps the reason you see three question marks is that the font
you are using doesn't support full width and half width characters.

What happens if you replace your string \uFFE2\uFF80\uFF93 with
\u2013 ?

Best regards,

James Kass
.



Re: Unicode character transformation through XSLT

2003-03-10 Thread Kenneth Whistler
Well, I can't diagnose exactly what is going wrong, but

Unicode character (\uFFE2\uFF80\uFF93)

is a sequence of a full-width not sign, followed by a
half-width katakana ta and a half-width katakana mo.

What you are actually looking for is the UTF-8 sequence:

0xE2 0x80 0x93

which is the UTF-8 equivalent of U+2013 EN DASH. (and of
!ENTITY ndash #8211;)

It appears that something in the way you (or the code you
are using) is getting Unicode characters from the
resource bundle is incorrectly converting 0xE2 -- 0xFFE2,
and so on.

--Ken

 Hi 
 My problem is that, I am getting Unicode character(\uFFE2\uFF80\uFF93)
 from resource bundle property file which is equivalent to ndash(-) and
 its works fine in html and XML but while Transformation through XSLT, it
 unable to interpret it. and hence in I am getting ???in stead of ndash.
 But if pass ndash from resource bundle and in xml if I declare
 !DOCTYPE xsl:stylesheet [!ENTITY ndash #8211;], than i am able to
 see proper output.
 In XML I am using UTF-8 encoding.
 So let me know how I can use Unicode character (\uFFE2\uFF80\uFF93) in
 XSL to resolve my issue because I will get only Unicode character from
 property file.