[ 
https://issues.apache.org/jira/browse/XALANJ-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17819439#comment-17819439
 ] 

Joseph Kessselman commented on XALANJ-2725:
-------------------------------------------

Added code for isolated High Surrogate. People do mishandle UTF16 strings, 
alas, so safety nets are needed.

BUT, ISSUE: Note that if a characters() call ending with a High Surrogate is 
issued unintentionally, we don't currently detect the isolated High Surrogate 
until the next characters() call, at which point the error indication (illegal 
Numeric Character Entity for the surrogate) gets dumped in as the first 
non-whitespace rather than appearing at the end of the characters() block it 
belonged to. To do otherwise seems to require that all the other events check 
`m_pendingUTF16HighSurrogate` and force it out before running their own logic. 
Which we could do, but that's a lot of overhead for a rare error, even if the 
test itself is a relatively cheap one.

Feels like there ought to be a cleaner answer than that, but I'm not convinced 
there is one – other than just saying "Nope, bad data, throw exception", which 
we seem to have moved away from in past decisions....

 

> Possible buffer-boundry issue when serializing surrogate pairs
> --------------------------------------------------------------
>
>                 Key: XALANJ-2725
>                 URL: https://issues.apache.org/jira/browse/XALANJ-2725
>             Project: XalanJ2
>          Issue Type: Improvement
>      Security Level: No security risk; visible to anyone(Ordinary problems in 
> Xalan projects.  Anybody can view the issue.) 
>          Components: Serialization
>            Reporter: Joe Kesselman
>            Assignee: Joe Kesselman
>            Priority: Major
>              Labels: Surrogates, escaping, unicode, utf
>         Attachments: astral-chars-split-buffer.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> XALANJ-2419 addressed a case where "astral" Unicode characters, requiring a 
> surrogate pair (two UTF-16 units), were not being serialized correctly. We 
> have a proposed fix for that.
> There is reported to still be an edge case when a surrogate pair which 
> crosses buffer boundaries might not be handled correctly. [~maxfortun] 
> offered what looks like a reasonable proposed fix 
> (https://github.com/maxfortun/xalan-j/blob/a9bd5591d9f8a523548aeec091e886b64c691628/src/org/apache/xml/serializer/ToStream.java#L1607),
>  but in my testing this was not serializing the surrogate pairs correctly, 
> causing regression on the tests XALANJ-2419 introduced. I don't know whether 
> that's because we're taking multiple paths through
> But the edge case does appear to be real, and if so we will need some such 
> solution.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to