Hi! I have an issue parsing XML containing Unicode strings with surrogate characters (Xerces 2.11.0). The following exception is thrown:
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 18; Character reference "�" is an invalid XML character. at org.apache.xerces.parsers.DOMParser.parse(Unknown Source) at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source) Simple code to reproduce the issue: byte[] enc1 = new byte[] {(byte)0xd8, 0x40, (byte)0xdc, 0x2a}; String result = new String(enc1, "UTF-16"); System.out.println(result); // Outputs 𠀪 correctly String saml="<name>lz1��.cct.cm</name>"; DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); DocumentBuilder builder = factory.newDocumentBuilder(); Document document= builder.parse(new InputSource(new StringReader(saml))); // Throws exception Do I parse the XML correctly? The XML I parse contains the following string: lz1𠀪.cct.cm