RE: utf-8 characters problem

Kahovec, Jakub 28 Feb 2005 23:21:10 -0000

It produdes Xerces 2.6.2 (LSParser, LSSerializer and XMLSerializer). 
I've been using xerces parser and serializer in my java authoring 
tool to load and save documents. I've found out the problem with 
encoding when I loaded and displayed the xml document (with char. ref. form 
chars)
in the jeditpanel component. Instead of &#x002b; and &#x2062; I saw '+' and 
'square-liked
character. I tried to serialized xml document to console as well as to file, 
load document via
InputStream or Reader input with LSInput but I never got results where would be 
chars sequence 
in origin form. 
Only when I explicitly set encoding in LSInput to (ISO-8859-1)and loaded it via 
InputStream 
then the chars sequence &#x2062; kept in the same form but the sequence 
&#x002b; was changed to '+' character anyway.
Then I tried to debug structure of DOM document (in Eclipse 3.1) but saw the 
same results (+ char 
and square char, probably it's only problem of showing utf-8 chars in eclipse.)
So to be honest I don't know now, how to find out, where is the problem, 
whether is it
during parsing, serializing or displaying data. I'm not so experienced in 
encodings as well as in charsets but as far as I know java treat internaly with 
chars in UTF-16 charset, could be it the a part of the problem ? I don't really 
know.


Thanks for any ideas.

Jakub


-----Original Message-----
From: Bob Foster [mailto:[EMAIL PROTECTED]
Sent: Mon 2/28/2005 10:36 PM
To: [EMAIL PROTECTED]
Subject: Re: utf-8 characters problem
 
Exactly what Xerces or standard API is producing this result? Are you 
sure you're not looking at the result in some editor (that is using the 
wrong code page to represent your characters)?

XML parsers deliver characters in Unicode. You are apparently trying to 
use the characters as though each character had eight bits.

Tell us a little more about what steps you took to see what you describe 
and maybe someone will be able to help.

Bob Foster

Jakub Kahovec wrote:
> Hi,
> when I parse the xml document (with xerces 2.6.2) which has in xml 
> declaration specified utf-8 encoding and which contains utf-8 characters 
> in character reference form &#xxxx;
> the parser replaces these characters  with ascii characters. For some 
> characters is ok but for instance InvisibleTimes change for some 
> incorrect strange character sentese.
> I'd like to know if is possible to prohibit changing characters from 
> char. ref. form ? Or does it exist some recommendation how to treat with 
> these characters.
> 
> Here is a piece of my 'problematic' xml document
> 
> <?xml version="1.0" encoding="UTF-8"?>
> <mathDoc>
> 
> <p>Factorise the following quadratic expression:
>        <math>
>          <mrow>
>            <msup>
>              <mrow>
>            <mi>x</mi>
>              </mrow>
>              <mrow>
>            <mn>2</mn>
>              </mrow>
>            </msup>
>            <mo>&#x002b;</mo> <!-- replaces with character + -->
>            <mi>p</mi>
>            <mo>&#x2062;</mo>   <!-- here is InvisibleTimes -->
>                    <mi>x</mi>
>            <mo>&#x002b;</mo>  <!-- replaces with character + -->
>            <mi>q</mi>
>          </mrow>
>        </math>
> 
> </mathDoc>
> 
> Thanks so much
> 
> Jakub



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

<<winmail.dat>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: utf-8 characters problem

Reply via email to