Astral characters written as a pair of NCRs with the surrogate scalar values
when using UTF-8
---------------------------------------------------------------------------------------------
Key: XALANJ-2419
URL: https://issues.apache.org/jira/browse/XALANJ-2419
Project: XalanJ2
Issue Type: Bug
Components: Serialization
Affects Versions: 2.7.1
Reporter: Henri Sivonen
org.apache.xml.serializer.ToStream contains the following code:
else if (m_encodingInfo.isInEncoding(ch)) {
// If the character is in the encoding, and
// not in the normal ASCII range, we also
// just leave it get added on to the clean characters
}
else {
// This is a fallback plan, we should never get here
// but if the character wasn't previously handled
// (i.e. isn't in the encoding, etc.) then what
// should we do? We choose to write out an entity
writeOutCleanChars(chars, i, lastDirtyCharProcessed);
writer.write("&#");
writer.write(Integer.toString(ch));
writer.write(';');
lastDirtyCharProcessed = i;
}
This leads to the wrong (latter) if branch running for surrogates, because
isInEncoding() for UTF-8 returns false for surrogates. It is always wrong
(regardless of encoding) to escape a surrogate as an NCR.
The practical effect of this bug is that any document with astral characters in
it ends up in an ill-formed serialization and does not parse back using an XML
parser.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]