Hi!

I have an issue parsing XML containing Unicode strings with surrogate 
characters (Xerces 2.11.0). The following exception is thrown:


        org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 18; 
Character reference "&#55360" is an invalid XML character.

        at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)

        at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)

Simple code to reproduce the issue:

         byte[] enc1 = new byte[] {(byte)0xd8, 0x40, (byte)0xdc, 0x2a};
         String result = new String(enc1, "UTF-16");
         System.out.println(result); // Outputs 𠀪 correctly

         String saml="<name>lz1&#55360;&#56362;.cct.cm</name>";
         DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
         DocumentBuilder builder = factory.newDocumentBuilder();
         Document document= builder.parse(new InputSource(new 
StringReader(saml))); // Throws exception


Do I parse the XML correctly?

The XML I parse contains the following string:

lz1𠀪.cct.cm


Reply via email to