RE: UTF-8 problem with Xerces-J2

Voytenko, Dimitry 4 Sep 2003 23:54:45 -0000

Hi,

The fact that you created String with UTF-8 doesn't mean much. String is always 
in UNICODE (encoding in the constructor specifies how to convert bytes to 
UNICODE). So this leaves the question how each UNICODE character in String is 
converted inside of JDBC to UTF-8. I don't know what JDBC does in this 
particular case, but I never heard about problems here. All I can suggest is to 
create String (just as a constant value) which in UTF-8 will have 2 and 3-byte 
characters. Write it to DB using PreparedStatement.setString() and then read it 
using ResultSet.getString(). If read result differs from the source value - the 
problem is in Oracle and you should go to Oracle's mailing list.


Thanks,
Dimitry

-----Original Message-----
From: Ravi Varanasi [mailto:[EMAIL PROTECTED]
Sent: Thursday, September 04, 2003 16:45
To: [EMAIL PROTECTED]
Cc: Voytenko, Dimitry
Subject: RE: UTF-8 problem with Xerces-J2







Hi Dimitry,
      Thanks for the reply. Our Oracle 9i database is setup with UTF-8
encoding. Also, in the code snippet below, I have created string with UTF-8
encoding (in stmt -1 ). Any work around ?

Thanks,

Ravi Varanasi
408 517 7675


|---------+---------------------------->
|         |           "Voytenko,       |
|         |           Dimitry"         |
|         |           <[EMAIL PROTECTED]|
|         |           data.com>        |
|         |                            |
|         |           09/04/2003 04:36 |
|         |           PM               |
|         |           Please respond to|
|         |           xerces-j-user    |
|         |                            |
|---------+---------------------------->
  
>---------------------------------------------------------------------------------------------------------------------------------------------|
  |                                                                             
                                                                |
  |       To:       <[EMAIL PROTECTED]>                                         
                                                     |
  |       cc:       <[EMAIL PROTECTED]>                                         
                                                          |
  |       Subject:  RE: UTF-8 problem with Xerces-J2                            
                                                                |
  
>---------------------------------------------------------------------------------------------------------------------------------------------|




Hi Ravi,

There's known issue in Oracle JDBC. When RDBMS encoding is ASCII or
ISO-Latin-1(and some others) JDBC takes only lower byte and passes it to
the DB. They call it "Unicode-ASCII conversion optimization". Thus two
Unicode characters 0x2041 and 0x3041 mean the same character 0x41 for
Oracle JDBC. This behaviour is persistent for 8i and 9i versions. Two
workarounds are possible here:
- convert string to bytes yourself (to your target encoding)
- use bin types: BLOB, etc.

Thanks,
Dimitry

-----Original Message-----
From: Ravi Varanasi [mailto:[EMAIL PROTECTED]
Sent: Thursday, September 04, 2003 16:26
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Subject: Re: UTF-8 problem with Xerces-J2







Thanks for all the inputs. I could isolate the issue to the way data is
written to the database. following is the code snippet for updating the
database.

1) Database column type Oracle LONG (LONGVARCHAR)
2) I am using OraclePreparedStatement using Oracle OCI driver.

longDesc is the String I created out of XML data in characters call-back
method.
I have created a temp String with UTF-8 encoding and using that stream to
insert into db in lines 1 & 2. Still the data is inserted as some junk in
database,
To verify if any data is lost during the conversion (while creating myStr),
I wrote the contents to a File and the data looks good in the output file
(out1.xml).

=========================================================================

        1 String myStr = new String(longDesc.getBytes(), "UTF-8");
        2 ostmt.setAsciiStream(7, new StringBufferInputStream(myStr),
myStr.length() );
        3
        4 OutputStreamWriter ostream = new OutputStreamWriter( new
FileOutputStream("/export/home/atgifce/DR/out1.xml"), "UTF-8" );
        5 ostream.write(myStr, 0, myStr.length() );
        6 ostream.close();

=========================================================================


Thanks for all the help !

PS: I have tried to write as BinaryStream as well. It did not work either.


Ravi Varanasi
408 517 7675




|---------+---------------------------->
|         |           mohammedi padaria|
|         |           <[EMAIL PROTECTED]|
|         |           om>              |
|         |                            |
|         |           09/04/2003 01:30 |
|         |           PM               |
|         |           Please respond to|
|         |           xerces-j-user    |
|         |                            |
|---------+---------------------------->

>---------------------------------------------------------------------------------------------------------------------------------------------|

  |
|
  |       To:       [EMAIL PROTECTED]
|
  |       cc:
|
  |       Subject:  Re: UTF-8 problem with Xerces-J2
|

>---------------------------------------------------------------------------------------------------------------------------------------------|





Also make sure that you are not using FileReader to get your Input stream.
It will read data in character and not in bytes. If you are using
FileReader ... change it to FileInputStream


Jeffrey Rodriguez <[EMAIL PROTECTED]> wrote:
 Hi Ravi,
 Your InputSource is OK, you should not have to set this since the parser
 would
 autodectect that your data is UTF-8.

 >4) Comig to the important question, do I convert the data to UTF-8 ?
 Answer
 >is NO. The apache documentation says that the encoding is "retained" when
 a
 >ByteStream is passed in to the parse method (as InputSource). So, the
 char

 Where did you read this?

 >array I get in characters call-back method must have char data encoded in
 >UTF-8. Is it not correct ? Since the parser does not guarantee that
 entire

 What the handler defines back as a character event return is an array of
 char ( Java char ).


 >char data is sent in a single call back metod call, I am constucting a
 >String using the char array. And the String constructor does not take the
 >encoding parameter. Is t! here any other way I can get String with UTF-8

 Of course it doesn't since String in Java are Unicode.

 >encoding ? I can not use byte array because if I covert char[] to byte
 [],
 >there is a good possibility of data loss. Following is the code in my
 >characters call back method. So, in essence, I am assuming that the
 char[]
 >I get has UTF-8 data. Please suggest if it is not correct ! !
 >

 I don't think this may be correct, but I will let others correct me. I
 think
 that your
 array of char is just that, an array of Unicode character representing
 your
 UTF-8 data
 byte stream.

 What do you call to save your data to Oracle? JDBC and array of bytes?
 Does
 this call requires a stream?

 You will have then to use an OutputStreamWriter with the proper encoding
 name to
 UTF-8 , and a ByteArrayOutputStream. Or just as easily just take the
 char[]
 which you
 collect in your stack, create one instance of String a! nd use the
 String.getBytes("UTF8")
 method to get a byte[] than then you can store in Oracle.


 >Please note that elementStack is a data structure I am using to store
 some
 >data. Pl ignore it.

 Yes...

 Hope this helps,


 >From: Ravi Varanasi
 >Reply-To: [EMAIL PROTECTED]
 >To: [EMAIL PROTECTED]
 >CC: [EMAIL PROTECTED]
 >Subject: Re: UTF-8 problem with Xerces-J2
 >Date: Thu, 4 Sep 2003 11:30:36 -0700
 >
 >
 >
 >
 >
 >
 >Hi Jeffrey,
 > Thanks for the reply. Following are the answers to your questions :-
 >
 >1) XML doc has encoding defined as UTF-8. This is the first stmt in the
 XML
 >file :
 ><?xml version="1.0" encoding="UTF-8"?>
 >
 >2) I am using InputSource with UTF-8 encoding set. Code snippet:
 >
 > InputSource ipSource = new InputSource();
 > ipSource.setEncoding("UTF-8");
 > ipSource.setByteStream( new FileInputStream( new File(inputFile) )
 >);
 > parser.parse(ipSource);
 >
 >3) Oracle totally supports UTF-8. I stored some UTF-8 data before ( using
 >SQL scripts) and it worked fine.
 >
 >4) Comig to the important question, do I convert the data to UTF-8 ?
 Answer
 >is NO. The apache documentation says that the encoding is "retained" when
 a
 >ByteStream is passed in to the parse method (as InputSource). So, the
 char
 >array I get in characters call-back method must have char data encoded in
 >UTF-8. Is it not correct ? Since the parser does not guarantee that
 entire
 >char data is sent in a single call back metod call, I am constucting a
 >String using the char array. And the String constructor does not take the
 >encoding parameter. Is there any other way I can get String with UTF-8
 >encoding ? I can not use byte array because if ! I covert char[] to byte
 [],
 >there is a good possibility of data loss. Following is the code in my
 >characters call back method. So, in essence, I am assuming that the
 char[]
 >I get has UTF-8 data. Please suggest if it is not correct ! !
 >
 >Please note that elementStack is a data structure I am using to store
 some
 >data. Pl ignore it.
 >

>---------------------------------------------------------------------------------------------------------------------------------------------------------------------------


 >
 > public void characters(char ch[], int start, int length) throws
 >SAXException {
 > String currData = new String(ch, start, length);
 >
 > if (elementStack != null) {
 > XMLElement currElement = (XMLElement) elementStack.peek();
 > currElement.appendData(currData.trim());
 > }
 >

>---------------------------------------------------------------------------------------------------------------------------------------------------------------------------


 >
 >
 >Thanks for the help,
 >
 >Ravi Varanasi
 >408 517 7675
 >
 >
 >
 >|---------+---------------------------->
 >| | "Jeffrey |
 >| | Rodriguez" |
 >| | >| | ail.com> |
 >| | |
 >| | 09/04/2003 11:02 |
 >| | AM |
 >| | Please respond to|
 >| | xerces-j-user |
 >| | |
 >|---------+---------------------------->
 >
 >

>---------------------------------------------------------------------------------------------------------------------------------------------|


 > |
 > |
 > | To: [EMAIL PROTECTED]
 > |
 > | cc:
 > |
 > | Subject: Re: UTF-8 problem with Xerces-J2
 > |
 >
 >

>---------------------------------------------------------------------------------------------------------------------------------------------|


 >
 >
 >
 >
 > >
 > >Hi,
 > > I am trying to parse an UTF-8 encoded document (which has lots of
 > >UTF-8 characters) using Xerces SAX parser. I am running this program on
 >Sun
 >
 >So what does your encodingDecl look like if any in your document?
 >
 >There is no problem with Xerces J parsing UTF8 data.
 >
 > >Solaris box with JDK 1.3.1_05. I save the data in XML (after parsing)
 to
 >
 >What do you mean by "save the data", how? Remember that the parser will
 get
 >
 >you
 >back the data as "Java" char (aka Unicode, UTF16). Do you transcode the
 >data
 >back into UTF-8
 >or does Oracle do that?
 >
 > >Oracle Database (which has UTF-8 encoding ). When I try to display the
 > >conten! t in a HTML after retrieving from database, I see some weired
 > >characters. Can any one suggest the reason ?
 >
 >If you had UTF16 data back from the parser and stored that into a UTF8 I
 >think that would
 >be problematic if you don't convert to UTF8.
 >UTF8 is multibyte , and UTF16 is double byte data. UTF8 and UTF16 from
 >U+0000 to U+007F
 >map to each other (more correctly to said Unicode code point map within
 >that
 >range to UTF-8,
 >therefore form some values if they are store directly into a UTF8 data
 >repository they map
 >map correctly but data outside this range wll not.
 >
 > >
 > >
 > > I am assuming that the UTF-8 format is supported by Xerces. I have
 >
 >Good assumption since xml parser must be able to read both UTF-8 and
 UTF-16
 >
 >documents.
 >
 >
 > > created a InputSource for the XML file and using it as the parameter
 &! gt; >for
 > > parse method.
 >
 >Did you use the InputSource and provided an encoding?
 >
 > > I am using OraclePreparedStatement because the column in which data
 >is
 > > stored is LONG. Do I need to do anything specific to let Oracle know
 > > that it is UTF-8 data ?
 >
 >You said that Oracle stores data as UTF-8??? right. You should this
 >question in a Oracle
 >discussion group just to be sure.
 >
 > > The encoding is specified properly in Jsp using both JSP Page param &
 > > HTML meta directive.
 > >
 >
 >Yes, try to see first that your data is correctly store as UTF-8 in
 Oracle.
 >
 >To test this pick
 >a multibyte with more than one byte like.
 >
 >The Russian Sheah, in UTF-8 I think it is "000416" that should be the
 >value
 >stored into UTF-8.
 >The value that the parser will give you back is Ud096 ( why? Because Java
 >chars are Unicode...).
 >
 >Hope this helps,
 >
 > Jeffrey Rodriguez
 > Silicon Valley
 >
 > >
 > >Any help is appreciated.
 > >
 > >Thanks in advance.
 > >
 > >Ravi Varanasi
 > >408 517 7675
 > >
 > >
 > >---------------------------------------------------------------------
 > >To unsubscribe, e-mail: [EMAIL PROTECTED]
 > >For additional commands, e-mail: [EMAIL PROTECTED]
 > >
 >
 >_________________________________________________________________
 >Compare Cable, DSL or Satellite plans: As low as $29.95.
 >https://broadband.msn.com
 >
 >
 >---------------------------------------------------------------------
 >To unsubscribe, e-mail: [EMAIL PROTECTED]
 >For additional commands, e-mail: [EMAIL PROTECTED]
 >
 >
 >
 >
 >
 >---------------------------------------------------------------------
 >To unsubscribe, e-mail: [EMAIL PROTECTED]
 >For additional commands, e-mail: [EMAIL PROTECTED]
 >

 _________________________________________________________________
 Get a FREE computer virus scan online from McAfee.
 http://clinic.mcafee.com/clinic/ibuy/campaign.asp?cid=3963


 ---------------------------------------------------------------------
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software







---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: UTF-8 problem with Xerces-J2

Reply via email to