Message:

  A new issue has been created in JIRA.

---------------------------------------------------------------------
View the issue:
  http://issues.apache.org/jira/browse/XERCESJ-1019

Here is an overview of the issue:
---------------------------------------------------------------------
        Key: XERCESJ-1019
    Summary: produces invalid character reference
       Type: Bug

     Status: Unassigned
   Priority: Minor

    Project: Xerces2-J
 Components: 
             Serialization
   Versions:
             2.0.2
             2.6.2

   Assignee: 
   Reporter: Thomas Bensler

    Created: Fri, 8 Oct 2004 10:33 AM
    Updated: Fri, 8 Oct 2004 10:33 AM
Environment: W2K, Sun JDK 1.4.2

Description:
When a org.w3c.Document contains a text node containing control characters <0x20 e.g. 
0x0b, these characters end up (well encoded) in the xml file. 

The code snippet below demonstrates it:
----------------------- 8< ----------------------- 
final File          file        = new File("E:\\temp\\illegalCharacter.xml");
final FileOutputStream  fout    = new FileOutputStream(file);
final XMLSerializer serializer  = new XMLSerializer();
final OutputFormat  outFormat   = new OutputFormat();
final DocumentImpl  doc         = new DocumentImpl();
final Element       rootElement = doc.createElement("rootelement");
final DOMParser     parser      = new DOMParser();

doc.appendChild(rootElement);

rootElement.appendChild(doc.createTextNode(new String(new char[] {11})));
outFormat.setEncoding("UTF-8");
outFormat.setIndenting(false);
serializer.setOutputFormat(outFormat);
serializer.setOutputByteStream(fout);
serializer.serialize(doc);
fout.close();
// reparsing the serialization result 
parser.parse(new InputSource(new FileInputStream(file)));
----------------------- 8< ----------------------- 

The produced xml file looks like that:
----------------------- 8< ----------------------- 
<?xml version="1.0" encoding="UTF-8"?>
<rootelement>&#xb;</rootelement>
----------------------- 8< ----------------------- 

reparsing the file fails:
[Fatal Error] :2:19: Character reference "&#b" is an invalid XML character.

As I understood the xml spec the parser is right rejecting the file. So I think the 
serializer should replace illegal characters by some legal placeholder character 
(space or '?').

The whole case came up in a content management system using Xerces 2-J for parsing and 
serializing. The content typed into JTextFields by users is put into TextNodes of a 
DOM tree and serialized. Some user grabbed the 0x0b character by doing some c&p from a 
powerpoint presentation. Even if it is not very common having this kind of characters 
in a java String, I thing the serializer should handle them without producing invalid 
xml.

If you define the the right behaviour for handling the control characters (which chars 
should be replaced by which placeholder) I would like to provide a patch (a hint for 
involved classes would be appreciated)

Thanks for listening!

Ciao, Thomas.



---------------------------------------------------------------------
JIRA INFORMATION:
This message is automatically generated by JIRA.

If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa

If you want more information on JIRA, or have a bug to report see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to