Hello OpenJDK team,
I would like to seek clarification on a behavior observed when performing an
XSL transformation followed by XML parsing.
Problem Description :
A SAXParseException is encountered when parsing the result of a Java XSL
transformation that uses HTML output and contains accented characters
represented.
Scenario:
We perform an XSL transformation using `Transformer`, and then attempt to parse
the resulting output using `DocumentBuilder`.
When the XSLT uses:
<xsl:output method="html" encoding="UTF-8" indent="yes"/>
the transformation succeeds, but parsing the result fails with the following
error:
[Fatal Error] :4:98: The entity "eacute" was referenced, but not declared.
org.xml.sax.SAXParseException; lineNumber: 4; columnNumber: 98; The entity
"eacute" was referenced, but not declared.
at
com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:257)
at
com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:338)
at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
at HTMLEntityParsingTest.main(HTMLEntityParsingTest.java:40)
However, when we change the XSLT output method to the below, the issue does not
occur.
<xsl:output method="xml" encoding="UTF-8" indent="yes"/>
Observation:
It appears that the HTML output contains named entities such as `é`,
which are not recognized by the XML parser.
Could you please confirm whether this behavior is expected, or if this could be
considered a bug or limitation in the current implementation?
Releases:
The issue is consistent in all OpenJDK version(JDK8 and above)
Thanks and regards,
Shruthi