DO NOT REPLY [Bug 7065] New: - Xerces encodes strange characters but can't parse them

bugzilla Tue, 12 Mar 2002 14:39:16 -0800

DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://nagoya.apache.org/bugzilla/show_bug.cgi?id=7065>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.


http://nagoya.apache.org/bugzilla/show_bug.cgi?id=7065

Xerces encodes strange characters but can't parse them

           Summary: Xerces encodes strange characters but can't parse them
           Product: Xerces-J
           Version: unspecified
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: Normal
          Priority: Other
         Component: Core
        AssignedTo: [EMAIL PROTECTED]
        ReportedBy: [EMAIL PROTECTED]


This may be a failing of my understanding of XML, but I've always been a strong 
believer that if a framework can generate a document, it should be able to 
parse it as well.  The following code generates an XML document that cannot be 
parsed by xerces.  The code and output follow:

Code:
    public static void main(String[] args) throws Exception {
        byte []bytes = { 28 };

        //Create the document
        Document document = new DocumentImpl();
        Element root = document.createElement("TEST");
        Node child = document.createTextNode(new String(bytes));
        root.appendChild(child);
        document.appendChild(root);

        //Serialize document to String
        ByteArrayOutputStream outStream = new ByteArrayOutputStream();
        OutputFormat format = new OutputFormat(document);
        XMLSerializer serial = new XMLSerializer(outStream, format);
        serial.asDOMSerializer();
        serial.serialize(document.getDocumentElement());
        outStream.flush();
        String xml = outStream.toString();

        //Print out text interpretaion of xml document
        System.out.println(xml);

        //reparse text into xml
        ByteArrayInputStream inputStream = new ByteArrayInputStream(xml.getBytes
());
        DOMParser parser = new DOMParser();
        InputSource inputSource = new InputSource(inputStream);
        parser.parse(inputSource);
        document = parser.getDocument();
    }

Output:
<?xml version="1.0" encoding="UTF-8"?>
<TEST>&#x1c;</TEST>

[Fatal Error] :2:13: Character reference "&#1c" is an invalid XML character.

org.xml.sax.SAXParseException: Character reference "&#1c" is an invalid XML 
character.

        at org.apache.xerces.parsers.DOMParser.parse(DOMParser.java:235)

        at testclassloader.TestXerces.main(TestXerces.java:53)

Exception in thread "main" 


This particular test was run with xerces 2.0.1, but I've had similar results 
with 1.4.4 though the outputted escaped character is different.

While I realize that character 28 does not fit within the XML spec as a valid 
character, I am curious why xerces will generate text node or serialize a 
document with an invalid character.

Also, is there any way to properly encode this document or do I need to 
manually escape my node text before encoding?

Thanks for your time and for working on a fantastic open-source project.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

DO NOT REPLY [Bug 7065] New: - Xerces encodes strange characters but can't parse them

Reply via email to