Message: A new issue has been created in JIRA.
--------------------------------------------------------------------- View the issue: http://issues.apache.org/jira/browse/XERCESJ-1019 Here is an overview of the issue: --------------------------------------------------------------------- Key: XERCESJ-1019 Summary: produces invalid character reference Type: Bug Status: Unassigned Priority: Minor Project: Xerces2-J Components: Serialization Versions: 2.0.2 2.6.2 Assignee: Reporter: Thomas Bensler Created: Fri, 8 Oct 2004 10:33 AM Updated: Fri, 8 Oct 2004 10:33 AM Environment: W2K, Sun JDK 1.4.2 Description: When a org.w3c.Document contains a text node containing control characters <0x20 e.g. 0x0b, these characters end up (well encoded) in the xml file. The code snippet below demonstrates it: ----------------------- 8< ----------------------- final File file = new File("E:\\temp\\illegalCharacter.xml"); final FileOutputStream fout = new FileOutputStream(file); final XMLSerializer serializer = new XMLSerializer(); final OutputFormat outFormat = new OutputFormat(); final DocumentImpl doc = new DocumentImpl(); final Element rootElement = doc.createElement("rootelement"); final DOMParser parser = new DOMParser(); doc.appendChild(rootElement); rootElement.appendChild(doc.createTextNode(new String(new char[] {11}))); outFormat.setEncoding("UTF-8"); outFormat.setIndenting(false); serializer.setOutputFormat(outFormat); serializer.setOutputByteStream(fout); serializer.serialize(doc); fout.close(); // reparsing the serialization result parser.parse(new InputSource(new FileInputStream(file))); ----------------------- 8< ----------------------- The produced xml file looks like that: ----------------------- 8< ----------------------- <?xml version="1.0" encoding="UTF-8"?> <rootelement></rootelement> ----------------------- 8< ----------------------- reparsing the file fails: [Fatal Error] :2:19: Character reference "&#b" is an invalid XML character. As I understood the xml spec the parser is right rejecting the file. So I think the serializer should replace illegal characters by some legal placeholder character (space or '?'). The whole case came up in a content management system using Xerces 2-J for parsing and serializing. The content typed into JTextFields by users is put into TextNodes of a DOM tree and serialized. Some user grabbed the 0x0b character by doing some c&p from a powerpoint presentation. Even if it is not very common having this kind of characters in a java String, I thing the serializer should handle them without producing invalid xml. If you define the the right behaviour for handling the control characters (which chars should be replaced by which placeholder) I would like to provide a patch (a hint for involved classes would be appreciated) Thanks for listening! Ciao, Thomas. --------------------------------------------------------------------- JIRA INFORMATION: This message is automatically generated by JIRA. If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa If you want more information on JIRA, or have a bug to report see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
