RE: Unicode entity resolved on reading document

Jan Tošovský Tue, 31 Mar 2009 12:43:10 -0700

Hi Paul, 
 
your problem can be solved by cloaking before parsing, followed by
uncloaking. Cloaking hides DTD and modifies all entities so they are
untouched during processing. Uncloaking is reverse process. I had the same
problem with one special operation with docbook file. Script for Perl can be
found here:
http://docbook.svn.sourceforge.net/viewvc/docbook/trunk/contrib/tools/cloak/
I have also my own VB script variant - for Windows so there is not necessary
to install Perl there.
 
Jan
 
  _____


From: [email protected] [mailto:[email protected]] 
Sent: Tuesday, March 31, 2009 11:47 AM
To: [email protected]
Cc: [email protected]
Subject: Re: Unicode entity resolved on reading document




Hi Paul,

Paul Wellner Bou <[email protected]> wrote on 03/31/2009 02:57:17 AM:

> [email protected] wrote:
> >    I think it's better to explain why this is a problem for you.
> > As long as the text encoding is correct there shouldn't be any
> > problem with replacing the character... So why is there a problem?
> 
> The problem is not technical in this case. It is a question of slightly 
> correcting some data in the SVG and writing it to a new file which 
> should be as similar as possible with the original file. This is 
> required as the people looking into the file to check it will compare it 
> with the original, don't have much knowledge about XML/SVG and will 
> reject it as there are modified lines which don't have to do anything 
> with the correction.

   Then you will either need to educate them or write a tool that will 
operate on the raw text stream.  You could potentially write a 
post-processing step that entified any characters that are outside of 
7bit Unicode.  It might give almost the same input... 


> So it is not possible to use an XML parser without replacing entities?

   No, even if it was Batik would fail on valid input: 
        <rect fill="&#x23;&#x46;&#x46;&#x30;&#x30;&#x30;&#x30;" 
              x="0" y="0" width="200" height="200"/> 

   So it's likely not useful...

RE: Unicode entity resolved on reading document

Reply via email to