[
https://issues.apache.org/jira/browse/XALANJ-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17811716#comment-17811716
]
Joseph Kessselman commented on XALANJ-2725:
-------------------------------------------
Looking back to the XML Recommendation:
{quote}It is a [fatal error|https://www.w3.org/TR/xml/#dt-fatal] if an XML
entity is determined (via default, encoding declaration, or higher-level
protocol) to be in a certain encoding but contains byte sequences that are not
legal in that encoding. Specifically, it is a fatal error if an entity encoded
in UTF-8 contains any ill-formed code unit sequences, as defined in section 3.9
of Unicode [[Unicode]|https://www.w3.org/TR/xml/#Unicode]. Unless an encoding
is determined by a higher-level protocol, it is also a [fatal
error|https://www.w3.org/TR/xml/#dt-fatal] if an XML entity contains no
encoding declaration and its content is not legal UTF-8 or UTF-16.
{quote}
Admittedly that refers to parsing rather than generating. I'd still argue that
if the user asked for XML or HTML output, generating ill-formed output probably
not a good idea, and numeric character references (in XML, at least) explicitly
forbid being used to represent isolated surrogate units. So I am leaning toward
throwing that exception rather then continuing to cheat, even if that would be
a behavior change.
But again, this is threatening to become a redesign rathole. Which really ought
to be its own work item.
> Possible buffer-boundry issue when serializing surrogate pairs
> --------------------------------------------------------------
>
> Key: XALANJ-2725
> URL: https://issues.apache.org/jira/browse/XALANJ-2725
> Project: XalanJ2
> Issue Type: Improvement
> Security Level: No security risk; visible to anyone(Ordinary problems in
> Xalan projects. Anybody can view the issue.)
> Components: Serialization
> Reporter: Joe Kesselman
> Assignee: Joe Kesselman
> Priority: Major
> Labels: Surrogates, escaping, unicode, utf
> Attachments: astral-chars-split-buffer.patch
>
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> XALANJ-2419 addressed a case where "astral" Unicode characters, requiring a
> surrogate pair (two UTF-16 units), were not being serialized correctly. We
> have a proposed fix for that.
> There is reported to still be an edge case when a surrogate pair which
> crosses buffer boundaries might not be handled correctly. [~maxfortun]
> offered what looks like a reasonable proposed fix
> (https://github.com/maxfortun/xalan-j/blob/a9bd5591d9f8a523548aeec091e886b64c691628/src/org/apache/xml/serializer/ToStream.java#L1607),
> but in my testing this was not serializing the surrogate pairs correctly,
> causing regression on the tests XALANJ-2419 introduced. I don't know whether
> that's because we're taking multiple paths through
> But the edge case does appear to be real, and if so we will need some such
> solution.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]