RE: utf-8 characters problem

Michael Glavassevich 1 Mar 2005 19:40:16 -0000

The parser replaced the character reference by including [1][2] the 
character in its place when the document was read.  The serializer has no 
way of knowing what syntax was originally used.


[1] http://www.w3.org/TR/2004/REC-xml-20040204/#entproc
[2] http://www.w3.org/TR/2004/REC-xml-20040204/#included

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: [EMAIL PROTECTED]
E-mail: [EMAIL PROTECTED]

"Kahovec, Jakub" <[EMAIL PROTECTED]> wrote on 03/01/2005 02:02:32 
PM:

> 'The serializer will try to write any characters it can in the encoding 
> given to the output document.........'
> 
> So is this supposed to mean that even if I specify the symbol in 
character
> reference form (i.e &#x2062; for Invisible times) and then set the 
output
> encoding to UTF-8 the serializer will replace it ?
> 
> 
> 
> -----Original Message-----
> From: Michael Glavassevich [mailto:[EMAIL PROTECTED]
> Sent: Tue 3/1/2005 6:42 PM
> To: [EMAIL PROTECTED]
> Subject: RE: utf-8 characters problem
> 
> The serializer will try to write any characters it can in the encoding 
> given to the output document. If a character has to be escaped either to 

> make the document well-formed or because the character cannot be 
expressed 
> in the output encoding, then the serializer will write it using the 
> predefined entities (such as 'amp' and 'lt') or character references.
> 
> You cannot control which characters are serialized as character 
> references.
> 
> There are many ways to express the same information in XML. Consider 
these 
> five document fragments (assume that entity 'seven' and 'elemref' are 
> defined somewhere and have replacement text '7' and '<elem>7</elem>' 
> respectively):
> 
> 1) <elem>7</elem>
> 2) <elem><![CDATA[7]]></elem>
> 3) <elem>&#x37;</elem>
> 4) <elem>&seven;</elem>
> 5) &elemref;
> 
> Regardless of what syntax is used, we have one element named 'elem' 
whose 
> content is '7'. They all convey the same information.
> 
> Michael Glavassevich
> XML Parser Development
> IBM Toronto Lab
> E-mail: [EMAIL PROTECTED]
> E-mail: [EMAIL PROTECTED]
> 
> "Kahovec, Jakub" <[EMAIL PROTECTED]> wrote on 03/01/2005 01:15:00 

> PM:
> 
> > '...the only difference between the two documents (in your example) 
> > will be that character references are expanded'
> > 
> > That's just the problem, i dont want the characters references to 
> beexpanded.
> > I only want to get just the same xml output as was the xml input. 
> > Nothing more.
> > Is it possible to do somehow ?
> > 
> > Jakub
> > 
> > 
> > -----Original Message-----
> > From: Bob Foster [mailto:[EMAIL PROTECTED]
> > Sent: Tue 3/1/2005 3:16 PM
> > To: [EMAIL PROTECTED]
> > Subject: Re: utf-8 characters problem
> > 
> > If you read the file in UTF-8, parse it, serialize it without adding 
any 
> 
> > whitespace and write the result back out in UTF-8, the only difference 

> > between the two documents (in your example) will be that character 
> > references are expanded.
> > 
> > The trouble arises when you don't specify the encoding on the way out. 

> > Then Java will use whatever is set as the platform encoding, e.g., 
> win1250.
> > 
> > What normal text editors do with a UTF-8 file is really outside the 
> > scope here. You have to use a competent editor.
> > 
> > Bob Foster
> > 
> > Jakub Kahovec wrote:
> > > I've been experimenting a bit with serializing and parsing (java 
1.4, 
> > > xerces 2.6.2, windows xp) and here are the results which I got
> > > This is a input xml file
> > > 
> > > <?xml version="1.0" encoding="UTF-8"?>
> > > <testEncoding>
> > > <czechCharsInUTF8>Д>ДTLTLlA?A?</czechCharsInUTF8>
> > > <ecaron>e</ecaron>
> > > <scaron>s</scaron>
> > > <invisibleTimesHex>&#x2062;</invisibleTimesHex>
> > > <invisibleTimeDec>?</invisibleTimeDec>
> > > <visibleTimes>&#x002a;</visibleTimes>
> > > <plus>&#x002b;</plus>
> > > </testEncoding>
> > > 
> > > after parsing and serializing fromt/to file via byte stream i got 
this 
> 
> > > output
> > > 
> > > <?xml version="1.0" encoding="UTF-8"?>
> > > <testEncoding>
> > > <czechCharsInUTF8>Д>ДTLTLlA?A?</czechCharsInUTF8>
> > > <ecaron>Д></ecaron>
> > > <scaron>L?</scaron>
> > > <invisibleTimesHex>вЃ?</invisibleTimesHex>
> > > <invisibleTimeDec>вЃ?</invisibleTimeDec>
> > > <visibleTimes>*</visibleTimes>
> > > <plus>+</plus>
> > > </testEncoding>
> > > 
> > > it seems to be pretty good, all characters are in UTF-8. Problem is 
> with 
> > > the InvisibleTimes again. if one wants to edit it it's just 
impossible 
> 
> > > because normal text editors show
> > > him sequence: вЃ? which nobody can understand it.
> > > 
> > > 
> > > after parsing and serializing fromt/to file via char stream i got 
this 
> 
> > > output
> > > 
> > > <?xml version="1.0" encoding="UTF-16"?>
> > > <testEncoding>
> > > <czechCharsInUTF8>&#xc4;>ДTLTLlA?A?</czechCharsInUTF8>
> > > <ecaron>e</ecaron>
> > > <scaron>s</scaron>
> > > <invisibleTimesHex>?</invisibleTimesHex>
> > > <invisibleTimeDec>?</invisibleTimeDec>
> > > <visibleTimes>*</visibleTimes>
> > > <plus>+</plus>
> > > </testEncoding>
> > > 
> > > it' completely useless, some of chars are in win1250 (ecaron ad 
> scaron) 
> > > charset, some of them are in utf-8 (part of tag <czechChardInUTF8> , 

> > > some of them are
> > > just question mark (invisibleTimes tags).
> > > 
> > > 
> > > These results make me a bit confused about which method should I use 

> to 
> > > be able to get following result :
> > > 
> > > <?xml version="1.0" encoding="UTF-8"?>
> > > <testEncoding>
> > > <czechCharsInUTF8>Д>ДTLTLlA?A?</czechCharsInUTF8>
> > > <ecaron>Д></ecaron>
> > > <scaron>L?</scaron>
> > > <invisibleTimesHex>&#x2062;</invisibleTimesHex>
> > > <invisibleTimeDec>?</invisibleTimeDec>
> > > <visibleTimes>*</visibleTimes>
> > > <plus>+</plus>
> > > </testEncoding>
> > > 
> > > 
> > > 
> > > Bob Foster wrote:
> > > 
> > >> As others have suggested, the problem is in JEditPane. You need to 
> > >> tell it to use a font that can display all of your characters. 
> > >> Unfortunately, that's platform-specific and I'm not much of a 
> > >> JEditPane user (Eclipse/SWT for me), but somebody can probably help 

> > >> you if you say what platform you're running on.
> > >>
> > >> Bob Foster
> > >>
> > >> Kahovec, Jakub wrote:
> > >>
> > >>> It produdes Xerces 2.6.2 (LSParser, LSSerializer and 
XMLSerializer). 
> 
> > >>> I've been using xerces parser and serializer in my java authoring 
> > >>> tool to load and save documents. I've found out the problem with 
> > >>> encoding when I loaded and displayed the xml document (with char. 
> > >>> ref. form chars)
> > >>> in the jeditpanel component. Instead of &#x002b; and &#x2062; I 
saw 
> > >>> '+' and 'square-liked
> > >>> character. I tried to serialized xml document to console as well 
as 
> > >>> to file, load document via
> > >>> InputStream or Reader input with LSInput but I never got results 
> > >>> where would be chars sequence in origin form. Only when I 
explicitly 
> 
> > >>> set encoding in LSInput to (ISO-8859-1)and loaded it via 
InputStream 
> 
> > >>> then the chars sequence &#x2062; kept in the same form but the 
> > >>> sequence &#x002b; was changed to '+' character anyway.
> > >>> Then I tried to debug structure of DOM document (in Eclipse 3.1) 
but 
> 
> > >>> saw the same results (+ char and square char, probably it's only 
> > >>> problem of showing utf-8 chars in eclipse.)
> > >>> So to be honest I don't know now, how to find out, where is the 
> > >>> problem, whether is it
> > >>> during parsing, serializing or displaying data. I'm not so 
> > >>> experienced in encodings as well as in charsets but as far as I 
know 
> 
> > >>> java treat internaly with chars in UTF-16 charset, could be it the 
a 
> 
> > >>> part of the problem ? I don't really know.
> > >>>
> > >>> Thanks for any ideas.
> > >>>
> > >>> Jakub
> > >>>
> > >>>
> > >>> -----Original Message-----
> > >>> From: Bob Foster [mailto:[EMAIL PROTECTED]
> > >>> Sent: Mon 2/28/2005 10:36 PM
> > >>> To: [EMAIL PROTECTED]
> > >>> Subject: Re: utf-8 characters problem
> > >>>
> > >>> Exactly what Xerces or standard API is producing this result? Are 
> you 
> > >>> sure you're not looking at the result in some editor (that is 
using 
> > >>> the wrong code page to represent your characters)?
> > >>>
> > >>> XML parsers deliver characters in Unicode. You are apparently 
trying 
> 
> > >>> to use the characters as though each character had eight bits.
> > >>>
> > >>> Tell us a little more about what steps you took to see what you 
> > >>> describe and maybe someone will be able to help.
> > >>>
> > >>> Bob Foster
> > >>>
> > >>> Jakub Kahovec wrote:
> > >>>
> > >>>> Hi,
> > >>>> when I parse the xml document (with xerces 2.6.2) which has in 
xml 
> > >>>> declaration specified utf-8 encoding and which contains utf-8 
> > >>>> characters in character reference form &#xxxx;
> > >>>> the parser replaces these characters with ascii characters. For 
> some 
> > >>>> characters is ok but for instance InvisibleTimes change for some 
> > >>>> incorrect strange character sentese.
> > >>>> I'd like to know if is possible to prohibit changing characters 
> from 
> > >>>> char. ref. form ? Or does it exist some recommendation how to 
treat 
> 
> > >>>> with these characters.
> > >>>>
> > >>>> Here is a piece of my 'problematic' xml document
> > >>>>
> > >>>> <?xml version="1.0" encoding="UTF-8"?>
> > >>>> <mathDoc>
> > >>>>
> > >>>> <p>Factorise the following quadratic expression:
> > >>>> <math>
> > >>>> <mrow>
> > >>>> <msup>
> > >>>> <mrow>
> > >>>> <mi>x</mi>
> > >>>> </mrow>
> > >>>> <mrow>
> > >>>> <mn>2</mn>
> > >>>> </mrow>
> > >>>> </msup>
> > >>>> <mo>&#x002b;</mo> <!-- replaces with character + -->
> > >>>> <mi>p</mi>
> > >>>> <mo>&#x2062;</mo> <!-- here is InvisibleTimes -->
> > >>>> <mi>x</mi>
> > >>>> <mo>&#x002b;</mo> <!-- replaces with character + -->
> > >>>> <mi>q</mi>
> > >>>> </mrow>
> > >>>> </math>
> > >>>>
> > >>>> </mathDoc>
> > >>>>
> > >>>> Thanks so much
> > >>>>
> > >>>> Jakub
> > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: utf-8 characters problem

Reply via email to