The parser replaced the character reference by including [1][2] the character in its place when the document was read. The serializer has no way of knowing what syntax was originally used.
[1] http://www.w3.org/TR/2004/REC-xml-20040204/#entproc [2] http://www.w3.org/TR/2004/REC-xml-20040204/#included Michael Glavassevich XML Parser Development IBM Toronto Lab E-mail: [EMAIL PROTECTED] E-mail: [EMAIL PROTECTED] "Kahovec, Jakub" <[EMAIL PROTECTED]> wrote on 03/01/2005 02:02:32 PM: > 'The serializer will try to write any characters it can in the encoding > given to the output document.........' > > So is this supposed to mean that even if I specify the symbol in character > reference form (i.e ⁢ for Invisible times) and then set the output > encoding to UTF-8 the serializer will replace it ? > > > > -----Original Message----- > From: Michael Glavassevich [mailto:[EMAIL PROTECTED] > Sent: Tue 3/1/2005 6:42 PM > To: [EMAIL PROTECTED] > Subject: RE: utf-8 characters problem > > The serializer will try to write any characters it can in the encoding > given to the output document. If a character has to be escaped either to > make the document well-formed or because the character cannot be expressed > in the output encoding, then the serializer will write it using the > predefined entities (such as 'amp' and 'lt') or character references. > > You cannot control which characters are serialized as character > references. > > There are many ways to express the same information in XML. Consider these > five document fragments (assume that entity 'seven' and 'elemref' are > defined somewhere and have replacement text '7' and '<elem>7</elem>' > respectively): > > 1) <elem>7</elem> > 2) <elem><![CDATA[7]]></elem> > 3) <elem>7</elem> > 4) <elem>&seven;</elem> > 5) &elemref; > > Regardless of what syntax is used, we have one element named 'elem' whose > content is '7'. They all convey the same information. > > Michael Glavassevich > XML Parser Development > IBM Toronto Lab > E-mail: [EMAIL PROTECTED] > E-mail: [EMAIL PROTECTED] > > "Kahovec, Jakub" <[EMAIL PROTECTED]> wrote on 03/01/2005 01:15:00 > PM: > > > '...the only difference between the two documents (in your example) > > will be that character references are expanded' > > > > That's just the problem, i dont want the characters references to > beexpanded. > > I only want to get just the same xml output as was the xml input. > > Nothing more. > > Is it possible to do somehow ? > > > > Jakub > > > > > > -----Original Message----- > > From: Bob Foster [mailto:[EMAIL PROTECTED] > > Sent: Tue 3/1/2005 3:16 PM > > To: [EMAIL PROTECTED] > > Subject: Re: utf-8 characters problem > > > > If you read the file in UTF-8, parse it, serialize it without adding any > > > whitespace and write the result back out in UTF-8, the only difference > > between the two documents (in your example) will be that character > > references are expanded. > > > > The trouble arises when you don't specify the encoding on the way out. > > Then Java will use whatever is set as the platform encoding, e.g., > win1250. > > > > What normal text editors do with a UTF-8 file is really outside the > > scope here. You have to use a competent editor. > > > > Bob Foster > > > > Jakub Kahovec wrote: > > > I've been experimenting a bit with serializing and parsing (java 1.4, > > > xerces 2.6.2, windows xp) and here are the results which I got > > > This is a input xml file > > > > > > <?xml version="1.0" encoding="UTF-8"?> > > > <testEncoding> > > > <czechCharsInUTF8>>TLTLlA?A?</czechCharsInUTF8> > > > <ecaron>e</ecaron> > > > <scaron>s</scaron> > > > <invisibleTimesHex>⁢</invisibleTimesHex> > > > <invisibleTimeDec>?</invisibleTimeDec> > > > <visibleTimes>*</visibleTimes> > > > <plus>+</plus> > > > </testEncoding> > > > > > > after parsing and serializing fromt/to file via byte stream i got this > > > > output > > > > > > <?xml version="1.0" encoding="UTF-8"?> > > > <testEncoding> > > > <czechCharsInUTF8>>TLTLlA?A?</czechCharsInUTF8> > > > <ecaron>></ecaron> > > > <scaron>L?</scaron> > > > <invisibleTimesHex>?</invisibleTimesHex> > > > <invisibleTimeDec>?</invisibleTimeDec> > > > <visibleTimes>*</visibleTimes> > > > <plus>+</plus> > > > </testEncoding> > > > > > > it seems to be pretty good, all characters are in UTF-8. Problem is > with > > > the InvisibleTimes again. if one wants to edit it it's just impossible > > > > because normal text editors show > > > him sequence: ? which nobody can understand it. > > > > > > > > > after parsing and serializing fromt/to file via char stream i got this > > > > output > > > > > > <?xml version="1.0" encoding="UTF-16"?> > > > <testEncoding> > > > <czechCharsInUTF8>Ä>TLTLlA?A?</czechCharsInUTF8> > > > <ecaron>e</ecaron> > > > <scaron>s</scaron> > > > <invisibleTimesHex>?</invisibleTimesHex> > > > <invisibleTimeDec>?</invisibleTimeDec> > > > <visibleTimes>*</visibleTimes> > > > <plus>+</plus> > > > </testEncoding> > > > > > > it' completely useless, some of chars are in win1250 (ecaron ad > scaron) > > > charset, some of them are in utf-8 (part of tag <czechChardInUTF8> , > > > some of them are > > > just question mark (invisibleTimes tags). > > > > > > > > > These results make me a bit confused about which method should I use > to > > > be able to get following result : > > > > > > <?xml version="1.0" encoding="UTF-8"?> > > > <testEncoding> > > > <czechCharsInUTF8>>TLTLlA?A?</czechCharsInUTF8> > > > <ecaron>></ecaron> > > > <scaron>L?</scaron> > > > <invisibleTimesHex>⁢</invisibleTimesHex> > > > <invisibleTimeDec>?</invisibleTimeDec> > > > <visibleTimes>*</visibleTimes> > > > <plus>+</plus> > > > </testEncoding> > > > > > > > > > > > > Bob Foster wrote: > > > > > >> As others have suggested, the problem is in JEditPane. You need to > > >> tell it to use a font that can display all of your characters. > > >> Unfortunately, that's platform-specific and I'm not much of a > > >> JEditPane user (Eclipse/SWT for me), but somebody can probably help > > >> you if you say what platform you're running on. > > >> > > >> Bob Foster > > >> > > >> Kahovec, Jakub wrote: > > >> > > >>> It produdes Xerces 2.6.2 (LSParser, LSSerializer and XMLSerializer). > > > >>> I've been using xerces parser and serializer in my java authoring > > >>> tool to load and save documents. I've found out the problem with > > >>> encoding when I loaded and displayed the xml document (with char. > > >>> ref. form chars) > > >>> in the jeditpanel component. Instead of + and ⁢ I saw > > >>> '+' and 'square-liked > > >>> character. I tried to serialized xml document to console as well as > > >>> to file, load document via > > >>> InputStream or Reader input with LSInput but I never got results > > >>> where would be chars sequence in origin form. Only when I explicitly > > > >>> set encoding in LSInput to (ISO-8859-1)and loaded it via InputStream > > > >>> then the chars sequence ⁢ kept in the same form but the > > >>> sequence + was changed to '+' character anyway. > > >>> Then I tried to debug structure of DOM document (in Eclipse 3.1) but > > > >>> saw the same results (+ char and square char, probably it's only > > >>> problem of showing utf-8 chars in eclipse.) > > >>> So to be honest I don't know now, how to find out, where is the > > >>> problem, whether is it > > >>> during parsing, serializing or displaying data. I'm not so > > >>> experienced in encodings as well as in charsets but as far as I know > > > >>> java treat internaly with chars in UTF-16 charset, could be it the a > > > >>> part of the problem ? I don't really know. > > >>> > > >>> Thanks for any ideas. > > >>> > > >>> Jakub > > >>> > > >>> > > >>> -----Original Message----- > > >>> From: Bob Foster [mailto:[EMAIL PROTECTED] > > >>> Sent: Mon 2/28/2005 10:36 PM > > >>> To: [EMAIL PROTECTED] > > >>> Subject: Re: utf-8 characters problem > > >>> > > >>> Exactly what Xerces or standard API is producing this result? Are > you > > >>> sure you're not looking at the result in some editor (that is using > > >>> the wrong code page to represent your characters)? > > >>> > > >>> XML parsers deliver characters in Unicode. You are apparently trying > > > >>> to use the characters as though each character had eight bits. > > >>> > > >>> Tell us a little more about what steps you took to see what you > > >>> describe and maybe someone will be able to help. > > >>> > > >>> Bob Foster > > >>> > > >>> Jakub Kahovec wrote: > > >>> > > >>>> Hi, > > >>>> when I parse the xml document (with xerces 2.6.2) which has in xml > > >>>> declaration specified utf-8 encoding and which contains utf-8 > > >>>> characters in character reference form &#xxxx; > > >>>> the parser replaces these characters with ascii characters. For > some > > >>>> characters is ok but for instance InvisibleTimes change for some > > >>>> incorrect strange character sentese. > > >>>> I'd like to know if is possible to prohibit changing characters > from > > >>>> char. ref. form ? Or does it exist some recommendation how to treat > > > >>>> with these characters. > > >>>> > > >>>> Here is a piece of my 'problematic' xml document > > >>>> > > >>>> <?xml version="1.0" encoding="UTF-8"?> > > >>>> <mathDoc> > > >>>> > > >>>> <p>Factorise the following quadratic expression: > > >>>> <math> > > >>>> <mrow> > > >>>> <msup> > > >>>> <mrow> > > >>>> <mi>x</mi> > > >>>> </mrow> > > >>>> <mrow> > > >>>> <mn>2</mn> > > >>>> </mrow> > > >>>> </msup> > > >>>> <mo>+</mo> <!-- replaces with character + --> > > >>>> <mi>p</mi> > > >>>> <mo>⁢</mo> <!-- here is InvisibleTimes --> > > >>>> <mi>x</mi> > > >>>> <mo>+</mo> <!-- replaces with character + --> > > >>>> <mi>q</mi> > > >>>> </mrow> > > >>>> </math> > > >>>> > > >>>> </mathDoc> > > >>>> > > >>>> Thanks so much > > >>>> > > >>>> Jakub > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]