[
https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438878#comment-16438878
]
Jesper Steen Møller commented on XALANJ-2419:
---------------------------------------------
[~thetaphi]: Now I see what you mean (perhaps): Yes, there is a very tricky
similar bug in the attribute values of ToHTMLStream, but not in the general
case (I think it's OK due to line 1440-1447 in writeAttrString, but have *not*
tested this.)
I only see the issue for ToHTMLStream in the case of URL attributes such as
A#HREF, where the output has explicitly been set to *not* encoded as am URL
(line 1294 in writeAttrURI). The default is to escape HTML attributes
containing URLs using URL-encoding, unless overridden with
xalan:use-url-escaping=yes in the XSLT output options.
(As an aside: I'm no expert, but the UTF-8 encoder inside the URL-encoding
(line 1208-1285 in writeAttrURI) seems legit, if a little verbose, instead of
just doing String.getBytes(UTF_8) and hexing that)
My v2 fix above does *not* address the corner case in line 1294.
> Astral characters written as a pair of NCRs with the surrogate scalar values
> when using UTF-8
> ---------------------------------------------------------------------------------------------
>
> Key: XALANJ-2419
> URL: https://issues.apache.org/jira/browse/XALANJ-2419
> Project: XalanJ2
> Issue Type: Bug
> Components: Serialization
> Affects Versions: 2.7.1
> Reporter: Henri Sivonen
> Priority: Major
> Attachments: XALANJ-2419-fix-v2.txt, XALANJ-2419-tests-v2.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
> else if (m_encodingInfo.isInEncoding(ch)) {
> // If the character is in the encoding, and
> // not in the normal ASCII range, we also
> // just leave it get added on to the clean characters
>
> }
> else {
> // This is a fallback plan, we should never get here
> // but if the character wasn't previously handled
> // (i.e. isn't in the encoding, etc.) then what
> // should we do? We choose to write out an entity
> writeOutCleanChars(chars, i, lastDirtyCharProcessed);
> writer.write("&#");
> writer.write(Integer.toString(ch));
> writer.write(';');
> lastDirtyCharProcessed = i;
> }
> This leads to the wrong (latter) if branch running for surrogates, because
> isInEncoding() for UTF-8 returns false for surrogates. It is always wrong
> (regardless of encoding) to escape a surrogate as an NCR.
> The practical effect of this bug is that any document with astral characters
> in it ends up in an ill-formed serialization and does not parse back using an
> XML parser.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]