RE: utf-8 characters problem

F. Andy Seidl 2 Mar 2005 14:38:31 -0000

Jakub,
You're explanation makes sense; I can see that it is an issue.  But,
technically anyway, it is a client issue rather than a Xerces issue.  Short
of finding (or creating) an editing client that understands the entire UTF-8
character set, here are two ideas on how you might get around the problem:


1) Consider creating a custom serializer (or specializing an existing one)
that renders your DOMs to your specifications.  You could add code that
renders selected characters (like InvisibleTimes) as numeric entities.

2) Look into the possibility of replacing the writer used by the current
serializer with one that writes selected characters as numeric entities.

  -- fas
 F. Andy Seidl, Co-founder
MyST Technology Partners
 
 

-----Original Message-----
From: Jakub Kahovec [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, March 02, 2005 5:58 AM
To: [EMAIL PROTECTED]
Subject: Re: utf-8 characters problem

Let me explain why is syntactic form relevant in my application.
My application is the authoring tool for creating multi-lingual xml 
based mathematic questions which will contain
presentation and content MathML tags (in addition to other xml tags)
For instance following expression x = x^2 + 4x + 5
has corresponding mathml construct

<math xmlns="http://www.w3.org/1998/Math/MathML";>
  <mrow>
    <msup>
      <mrow>
        <mi>x</mi>
      </mrow>
      <mrow>
        <mn>2</mn>
      </mrow>
    </msup>    
    <mo>+</mo>    
    <mn>4</mn>
    <mo>&#8290;</mo>
    <mi>x</mi>
    <mo>+</mo>
    <mn>5</mn>
  </mrow>
</math> 

Between 4 and x is InvisibleTimes symbol (&#8290;). 
However, when I parse it (in UTF-8 encoding because of multi-lingual
support) 
the parser will replace the char ref &#8290; with UTF-8 chars (something
like вЃ?)
And then later when I serialize it I'll get these characters (вЃ?) as well.
But neither browser (i.e mozilla with mathml support) nor any xml editor
don't
recognize these characters as InvisibleTimes. So one can neither view nor
edit it.
So that's why I need to preserve the original char ref form of
InvisibleTimes 
(and other similar symbols).

Jakub


F. Andy Seidl wrote:

>I think the real issue is that when the parser transforms an XML document
>into a DOM, it actually *loses* some information--but not information that
>is "significant" according XML semantics.  As Michael described earlier,
>several different syntactic forms express the identical meaning in terms of
>XML semantics.  Since each of the syntactic forms has the same (XML)
>meaning, and since the task of the parser is to create a DOM that
accurately
>reflects that meaning, many different XML documents can express the exact
>same DOM.  Thus, XML -> DOM is a many-to-one relationship.  Or put the
other
>way, DOM -> XML presents a one-to-many choice, all which are equally valid
>(in term of XML semantics).
>
>The task of a serializer is to produce, from a DOM, an XML document
>accurately reflects the content of the DOM.  Whether a specific character
is
>expressed as UTF-8 encoding, Windows-1251, etc., is irrelevant (in terms of
>XML semantics).  It sounds like it is *not* irrelevant for *your*
>application.  If that is the case, then the choice of syntactic form itself
>a type of metadata for your application that would need to be tracked by
>your application (but before going to the trouble of doing that, I would
>ask, why is the syntactic form relevant to your app?)
>
>  -- fas
>
>F. Andy Seidl, Co-founder
>MyST Technology Partners
>http://myst-technology.com | http://blogsite.com
> 
>
>-----Original Message-----
>From: Kahovec, Jakub [mailto:[EMAIL PROTECTED] 
>Sent: Tuesday, March 01, 2005 4:07 PM
>To: [EMAIL PROTECTED]
>Subject: RE: utf-8 characters problem
>
>And is it possible to say to parser to don't replace these character
>reference ?
>Maybe it's not by the standard but how else I can preserve the origin form
>of character
>reference in xml during parsing.
>
>
>-----Original Message-----
>From: Michael Glavassevich [mailto:[EMAIL PROTECTED]
>Sent: Tue 3/1/2005 7:40 PM
>To: [EMAIL PROTECTED]
>Subject: RE: utf-8 characters problem
> 
>The parser replaced the character reference by including [1][2] the 
>character in its place when the document was read.  The serializer has no 
>way of knowing what syntax was originally used.
>
>[1] http://www.w3.org/TR/2004/REC-xml-20040204/#entproc
>[2] http://www.w3.org/TR/2004/REC-xml-20040204/#included
>
>Michael Glavassevich
>XML Parser Development
>IBM Toronto Lab
>E-mail: [EMAIL PROTECTED]
>E-mail: [EMAIL PROTECTED]
>
>"Kahovec, Jakub" <[EMAIL PROTECTED]> wrote on 03/01/2005 02:02:32 
>PM:
>
>  
>
>>'The serializer will try to write any characters it can in the encoding 
>>given to the output document.........'
>>
>>So is this supposed to mean that even if I specify the symbol in 
>>    
>>
>character
>  
>
>>reference form (i.e &#x2062; for Invisible times) and then set the 
>>    
>>
>output
>  
>
>>encoding to UTF-8 the serializer will replace it ?
>>
>>
>>
>>-----Original Message-----
>>From: Michael Glavassevich [mailto:[EMAIL PROTECTED]
>>Sent: Tue 3/1/2005 6:42 PM
>>To: [EMAIL PROTECTED]
>>Subject: RE: utf-8 characters problem
>>
>>The serializer will try to write any characters it can in the encoding 
>>given to the output document. If a character has to be escaped either to 
>>    
>>
>
>  
>
>>make the document well-formed or because the character cannot be 
>>    
>>
>expressed 
>  
>
>>in the output encoding, then the serializer will write it using the 
>>predefined entities (such as 'amp' and 'lt') or character references.
>>
>>You cannot control which characters are serialized as character 
>>references.
>>
>>There are many ways to express the same information in XML. Consider 
>>    
>>
>these 
>  
>
>>five document fragments (assume that entity 'seven' and 'elemref' are 
>>defined somewhere and have replacement text '7' and '<elem>7</elem>' 
>>respectively):
>>
>>1) <elem>7</elem>
>>2) <elem><![CDATA[7]]></elem>
>>3) <elem>&#x37;</elem>
>>4) <elem>&seven;</elem>
>>5) &elemref;
>>
>>Regardless of what syntax is used, we have one element named 'elem' 
>>    
>>
>whose 
>  
>
>>content is '7'. They all convey the same information.
>>
>>Michael Glavassevich
>>XML Parser Development
>>IBM Toronto Lab
>>E-mail: [EMAIL PROTECTED]
>>E-mail: [EMAIL PROTECTED]
>>
>>"Kahovec, Jakub" <[EMAIL PROTECTED]> wrote on 03/01/2005 01:15:00 
>>    
>>
>
>  
>
>>PM:
>>
>>    
>>
>>>'...the only difference between the two documents (in your example) 
>>>will be that character references are expanded'
>>>
>>>That's just the problem, i dont want the characters references to 
>>>      
>>>
>>beexpanded.
>>    
>>
>>>I only want to get just the same xml output as was the xml input. 
>>>Nothing more.
>>>Is it possible to do somehow ?
>>>
>>>Jakub
>>>
>>>
>>>-----Original Message-----
>>>From: Bob Foster [mailto:[EMAIL PROTECTED]
>>>Sent: Tue 3/1/2005 3:16 PM
>>>To: [EMAIL PROTECTED]
>>>Subject: Re: utf-8 characters problem
>>>
>>>If you read the file in UTF-8, parse it, serialize it without adding 
>>>      
>>>
>any 
>  
>
>>>whitespace and write the result back out in UTF-8, the only difference 
>>>      
>>>
>
>  
>
>>>between the two documents (in your example) will be that character 
>>>references are expanded.
>>>
>>>The trouble arises when you don't specify the encoding on the way out. 
>>>      
>>>
>
>  
>
>>>Then Java will use whatever is set as the platform encoding, e.g., 
>>>      
>>>
>>win1250.
>>    
>>
>>>What normal text editors do with a UTF-8 file is really outside the 
>>>scope here. You have to use a competent editor.
>>>
>>>Bob Foster
>>>
>>>Jakub Kahovec wrote:
>>>      
>>>
>>>>I've been experimenting a bit with serializing and parsing (java 
>>>>        
>>>>
>1.4, 
>  
>
>>>>xerces 2.6.2, windows xp) and here are the results which I got
>>>>This is a input xml file
>>>>
>>>><?xml version="1.0" encoding="UTF-8"?>
>>>><testEncoding>
>>>><czechCharsInUTF8>Д>ДTLTLlA?A?</czechCharsInUTF8>
>>>><ecaron>e</ecaron>
>>>><scaron>s</scaron>
>>>><invisibleTimesHex>&#x2062;</invisibleTimesHex>
>>>><invisibleTimeDec>?</invisibleTimeDec>
>>>><visibleTimes>&#x002a;</visibleTimes>
>>>><plus>&#x002b;</plus>
>>>></testEncoding>
>>>>
>>>>after parsing and serializing fromt/to file via byte stream i got 
>>>>        
>>>>
>this 
>  
>
>>>>output
>>>>
>>>><?xml version="1.0" encoding="UTF-8"?>
>>>><testEncoding>
>>>><czechCharsInUTF8>Д>ДTLTLlA?A?</czechCharsInUTF8>
>>>><ecaron>Д></ecaron>
>>>><scaron>L?</scaron>
>>>><invisibleTimesHex>вЃ?</invisibleTimesHex>
>>>><invisibleTimeDec>вЃ?</invisibleTimeDec>
>>>><visibleTimes>*</visibleTimes>
>>>><plus>+</plus>
>>>></testEncoding>
>>>>
>>>>it seems to be pretty good, all characters are in UTF-8. Problem is 
>>>>        
>>>>
>>with 
>>    
>>
>>>>the InvisibleTimes again. if one wants to edit it it's just 
>>>>        
>>>>
>impossible 
>  
>
>>>>because normal text editors show
>>>>him sequence: вЃ? which nobody can understand it.
>>>>
>>>>
>>>>after parsing and serializing fromt/to file via char stream i got 
>>>>        
>>>>
>this 
>  
>
>>>>output
>>>>
>>>><?xml version="1.0" encoding="UTF-16"?>
>>>><testEncoding>
>>>><czechCharsInUTF8>&#xc4;>ДTLTLlA?A?</czechCharsInUTF8>
>>>><ecaron>e</ecaron>
>>>><scaron>s</scaron>
>>>><invisibleTimesHex>?</invisibleTimesHex>
>>>><invisibleTimeDec>?</invisibleTimeDec>
>>>><visibleTimes>*</visibleTimes>
>>>><plus>+</plus>
>>>></testEncoding>
>>>>
>>>>it' completely useless, some of chars are in win1250 (ecaron ad 
>>>>        
>>>>
>>scaron) 
>>    
>>
>>>>charset, some of them are in utf-8 (part of tag <czechChardInUTF8> , 
>>>>        
>>>>
>
>  
>
>>>>some of them are
>>>>just question mark (invisibleTimes tags).
>>>>
>>>>
>>>>These results make me a bit confused about which method should I use 
>>>>        
>>>>
>
>  
>
>>to 
>>    
>>
>>>>be able to get following result :
>>>>
>>>><?xml version="1.0" encoding="UTF-8"?>
>>>><testEncoding>
>>>><czechCharsInUTF8>Д>ДTLTLlA?A?</czechCharsInUTF8>
>>>><ecaron>Д></ecaron>
>>>><scaron>L?</scaron>
>>>><invisibleTimesHex>&#x2062;</invisibleTimesHex>
>>>><invisibleTimeDec>?</invisibleTimeDec>
>>>><visibleTimes>*</visibleTimes>
>>>><plus>+</plus>
>>>></testEncoding>
>>>>
>>>>
>>>>
>>>>Bob Foster wrote:
>>>>
>>>>        
>>>>
>>>>>As others have suggested, the problem is in JEditPane. You need to 
>>>>>tell it to use a font that can display all of your characters. 
>>>>>Unfortunately, that's platform-specific and I'm not much of a 
>>>>>JEditPane user (Eclipse/SWT for me), but somebody can probably help 
>>>>>          
>>>>>
>
>  
>
>>>>>you if you say what platform you're running on.
>>>>>
>>>>>Bob Foster
>>>>>
>>>>>Kahovec, Jakub wrote:
>>>>>
>>>>>          
>>>>>
>>>>>>It produdes Xerces 2.6.2 (LSParser, LSSerializer and 
>>>>>>            
>>>>>>
>XMLSerializer). 
>  
>
>>>>>>I've been using xerces parser and serializer in my java authoring 
>>>>>>tool to load and save documents. I've found out the problem with 
>>>>>>encoding when I loaded and displayed the xml document (with char. 
>>>>>>ref. form chars)
>>>>>>in the jeditpanel component. Instead of &#x002b; and &#x2062; I 
>>>>>>            
>>>>>>
>saw 
>  
>
>>>>>>'+' and 'square-liked
>>>>>>character. I tried to serialized xml document to console as well 
>>>>>>            
>>>>>>
>as 
>  
>
>>>>>>to file, load document via
>>>>>>InputStream or Reader input with LSInput but I never got results 
>>>>>>where would be chars sequence in origin form. Only when I 
>>>>>>            
>>>>>>
>explicitly 
>  
>
>>>>>>set encoding in LSInput to (ISO-8859-1)and loaded it via 
>>>>>>            
>>>>>>
>InputStream 
>  
>
>>>>>>then the chars sequence &#x2062; kept in the same form but the 
>>>>>>sequence &#x002b; was changed to '+' character anyway.
>>>>>>Then I tried to debug structure of DOM document (in Eclipse 3.1) 
>>>>>>            
>>>>>>
>but 
>  
>
>>>>>>saw the same results (+ char and square char, probably it's only 
>>>>>>problem of showing utf-8 chars in eclipse.)
>>>>>>So to be honest I don't know now, how to find out, where is the 
>>>>>>problem, whether is it
>>>>>>during parsing, serializing or displaying data. I'm not so 
>>>>>>experienced in encodings as well as in charsets but as far as I 
>>>>>>            
>>>>>>
>know 
>  
>
>>>>>>java treat internaly with chars in UTF-16 charset, could be it the 
>>>>>>            
>>>>>>
>a 
>  
>
>>>>>>part of the problem ? I don't really know.
>>>>>>
>>>>>>Thanks for any ideas.
>>>>>>
>>>>>>Jakub
>>>>>>
>>>>>>
>>>>>>-----Original Message-----
>>>>>>From: Bob Foster [mailto:[EMAIL PROTECTED]
>>>>>>Sent: Mon 2/28/2005 10:36 PM
>>>>>>To: [EMAIL PROTECTED]
>>>>>>Subject: Re: utf-8 characters problem
>>>>>>
>>>>>>Exactly what Xerces or standard API is producing this result? Are 
>>>>>>            
>>>>>>
>>you 
>>    
>>
>>>>>>sure you're not looking at the result in some editor (that is 
>>>>>>            
>>>>>>
>using 
>  
>
>>>>>>the wrong code page to represent your characters)?
>>>>>>
>>>>>>XML parsers deliver characters in Unicode. You are apparently 
>>>>>>            
>>>>>>
>trying 
>  
>
>>>>>>to use the characters as though each character had eight bits.
>>>>>>
>>>>>>Tell us a little more about what steps you took to see what you 
>>>>>>describe and maybe someone will be able to help.
>>>>>>
>>>>>>Bob Foster
>>>>>>
>>>>>>Jakub Kahovec wrote:
>>>>>>
>>>>>>            
>>>>>>
>>>>>>>Hi,
>>>>>>>when I parse the xml document (with xerces 2.6.2) which has in 
>>>>>>>              
>>>>>>>
>xml 
>  
>
>>>>>>>declaration specified utf-8 encoding and which contains utf-8 
>>>>>>>characters in character reference form &#xxxx;
>>>>>>>the parser replaces these characters with ascii characters. For 
>>>>>>>              
>>>>>>>
>>some 
>>    
>>
>>>>>>>characters is ok but for instance InvisibleTimes change for some 
>>>>>>>incorrect strange character sentese.
>>>>>>>I'd like to know if is possible to prohibit changing characters 
>>>>>>>              
>>>>>>>
>>from 
>>    
>>
>>>>>>>char. ref. form ? Or does it exist some recommendation how to 
>>>>>>>              
>>>>>>>
>treat 
>  
>
>>>>>>>with these characters.
>>>>>>>
>>>>>>>Here is a piece of my 'problematic' xml document
>>>>>>>
>>>>>>><?xml version="1.0" encoding="UTF-8"?>
>>>>>>><mathDoc>
>>>>>>>
>>>>>>><p>Factorise the following quadratic expression:
>>>>>>><math>
>>>>>>><mrow>
>>>>>>><msup>
>>>>>>><mrow>
>>>>>>><mi>x</mi>
>>>>>>></mrow>
>>>>>>><mrow>
>>>>>>><mn>2</mn>
>>>>>>></mrow>
>>>>>>></msup>
>>>>>>><mo>&#x002b;</mo> <!-- replaces with character + -->
>>>>>>><mi>p</mi>
>>>>>>><mo>&#x2062;</mo> <!-- here is InvisibleTimes -->
>>>>>>><mi>x</mi>
>>>>>>><mo>&#x002b;</mo> <!-- replaces with character + -->
>>>>>>><mi>q</mi>
>>>>>>></mrow>
>>>>>>></math>
>>>>>>>
>>>>>>></mathDoc>
>>>>>>>
>>>>>>>Thanks so much
>>>>>>>
>>>>>>>Jakub
>>>>>>>              
>>>>>>>
>>>---------------------------------------------------------------------
>>>To unsubscribe, e-mail: [EMAIL PROTECTED]
>>>For additional commands, e-mail: [EMAIL PROTECTED]
>>>
>>>
>>>---------------------------------------------------------------------
>>>To unsubscribe, e-mail: [EMAIL PROTECTED]
>>>For additional commands, e-mail: [EMAIL PROTECTED]
>>>      
>>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: [EMAIL PROTECTED]
>>For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: [EMAIL PROTECTED]
>>For additional commands, e-mail: [EMAIL PROTECTED]
>>    
>>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: [EMAIL PROTECTED]
>For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: [EMAIL PROTECTED]
>For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: utf-8 characters problem

Reply via email to