[ 
https://issues.apache.org/jira/browse/XALANJ-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17818226#comment-17818226
 ] 

Joseph Kessselman commented on XALANJ-2725:
-------------------------------------------

Resuming looking at this.

The main issue I see with using a state variable is that, done right, it seems 
to force every pass through the rendering loop to check for "high surrogate was 
encountered but not followed by a low surrogate." I can move the surrogate 
handling to the top of the loop, but I'm slightly concerned about performance, 
even though "Is this a high/low surrogate" should be a fast and simple mask 
test that could be inlined during JIT compilation. And we really should 
consider what happens if the high surrogate is at end of buffer but the next 
call is an event other than characters(); that may mean elements and such also 
need to check for a leftover high and handle it. Ditto switching into or out of 
CDATA section rendering.

Pondering whether there is a good way to simplify/clarify this.

And I'm really not delighted with the "silent failure" of outputting isolated 
surrogates as inappropriate numeric character references, especially since 
we're inconsistent about that... see past discussion about whether that should 
be normalized and configurable.

> Possible buffer-boundry issue when serializing surrogate pairs
> --------------------------------------------------------------
>
>                 Key: XALANJ-2725
>                 URL: https://issues.apache.org/jira/browse/XALANJ-2725
>             Project: XalanJ2
>          Issue Type: Improvement
>      Security Level: No security risk; visible to anyone(Ordinary problems in 
> Xalan projects.  Anybody can view the issue.) 
>          Components: Serialization
>            Reporter: Joe Kesselman
>            Assignee: Joe Kesselman
>            Priority: Major
>              Labels: Surrogates, escaping, unicode, utf
>         Attachments: astral-chars-split-buffer.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> XALANJ-2419 addressed a case where "astral" Unicode characters, requiring a 
> surrogate pair (two UTF-16 units), were not being serialized correctly. We 
> have a proposed fix for that.
> There is reported to still be an edge case when a surrogate pair which 
> crosses buffer boundaries might not be handled correctly. [~maxfortun] 
> offered what looks like a reasonable proposed fix 
> (https://github.com/maxfortun/xalan-j/blob/a9bd5591d9f8a523548aeec091e886b64c691628/src/org/apache/xml/serializer/ToStream.java#L1607),
>  but in my testing this was not serializing the surrogate pairs correctly, 
> causing regression on the tests XALANJ-2419 introduced. I don't know whether 
> that's because we're taking multiple paths through
> But the edge case does appear to be real, and if so we will need some such 
> solution.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to