[jira] [Comment Edited] (XALANC-743) XalanOutputStream::transcode falls into infinite loop on 4 bytes unicode till out of memory

Daehyeon Kim (Jira) Mon, 01 Apr 2024 22:52:09 -0700


    [ 
https://issues.apache.org/jira/browse/XALANC-743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829379#comment-17829379
 ]


Daehyeon Kim edited comment on XALANC-743 at 4/2/24 5:51 AM:
-------------------------------------------------------------

I also met this problem. Appears to be a very similar issue to the issue 
reported in XALANJ-2419. If transforming when Unicode supplementary characters 
are included in the input, the output will be corrupted. Also confirmed that 
both Xalan-C++ versions 1.10 to 1.12 have the same problem.
 
 
This problem appears for for both UTF-16 output encoding and UTF-8 output 
encoding. When using UTF16Writer, supplementary characters are converted to 
broken characters such as "??". If using UTF8Writer, a more serious problem 
arises where no output results are obtained. This suggests a higher-level issue 
than serialization, with something suspicious found in the XPath 
FunctionSubstring implementation. There's no issue with UTF16Writer and 
UTF8Writer.
 
 
XPath FunctionSubstring takes a character index position as an argument. A 
surrogate pair needs to be counted as one character (as per the XPath 
Recommendation). Thus, the string buffer positions where the surrogate pair is 
considered at the character index need to be counted. Currently, Xalan 
truncates the string buffer only with the arguments received for the character 
index. This will causing the truncated string to be corrupted when the 
surrogate pair is in the string buffer.
 
 
The same applies to this issue. If a string containing supplementary characters 
is incorrectly truncated in FunctionSubstring, surrogate pairs (which cannot 
appear in UTF-8) may surprisingly appear in UTF8Writer (as evident in the 
assertion in UTF8Writer "// We should never get a high or low surrogate 
here..."). Therefore, fixing XPath FunctionSubstring to count the string data 
length considering the surrogate pair resolves this issue automatically.
 
 
Please refer to the pull request for the changed code.


was (Author: JIRAUSER304680):
I also met this problem. Appears to be a very similar issue to the issue 
reported in XALANJ-2419. If transforming when Unicode supplementary characters 
are included in the input, the output will be corrupted. Also confirmed that 
both Xalan-C++ versions 1.10 to 1.12 have the same problem.
 
 
This problem appears for for both UTF-16 output encoding and UTF-8 output 
encoding. When using UTF16Writer, supplementary characters are converted to 
broken characters such as "??". If using UTF8Writer, a more serious problem 
arises where no output results are obtained. This suggests a higher-level issue 
than serialization, with something suspicious found in the XPath 
FunctionSubstring implementation. There's no issue with UTF16Writer and 
UTF8Writer.
 
 
XPath FunctionSubstring takes a character index position as an argument. A 
surrogate pair needs to be counted as one character (as per the XPath 
Recommendation). Thus, the string buffer positions where the surrogate pair is 
considered at the character index need to be counted. Currently, Xalan 
truncates the string buffer only with the arguments received for the character 
index, causing the truncated string to be corrupted when the surrogate pair is 
in the string buffer.
 
 
The same applies to this issue. If a string containing supplementary characters 
is incorrectly truncated in FunctionSubstring, surrogate pairs (which cannot 
appear in UTF-8) may surprisingly appear in UTF8Writer (as evident in the 
assertion in UTF8Writer "// We should never get a high or low surrogate 
here..."). Therefore, fixing XPath FunctionSubstring to count the string data 
length considering the surrogate pair resolves this issue automatically.
 
 
Please refer to the pull request for the changed code.

> XalanOutputStream::transcode falls into infinite loop on 4 bytes unicode till 
> out of memory
> -------------------------------------------------------------------------------------------
>
>                 Key: XALANC-743
>                 URL: https://issues.apache.org/jira/browse/XALANC-743
>             Project: XalanC
>          Issue Type: Bug
>          Components: XalanC
>    Affects Versions: 1.10
>         Environment: Linux
>            Reporter: Jiangbei Fan
>            Assignee: Steven J. Hathaway
>            Priority: Major
>
> In some rare cases, XalanTransformer::transform would stuck or crash when the 
> input/stylesheet contains 4-byte unicode. And I traced down the root cause in 
> XalanOutputStream::transcode
> When the transcode buffer contains unicode of size 4 bytes, and the last 
> XalanDOMChar in the buffer is the first 2 bytes of a 4-byte unicode char. The 
> XalanOutputStream::transcode will fall into an infinite loop till it is out 
> of memory. As XMLUTF8Transcoder.cpp in xerces will not consume the last 
> 2-bytes if it is part of 4 byte unicode. And transcode always loop until all 
> chars in the buffer is eaten. Specifically this will happen when the last 
> XalanDOMChar  in the input buffer is between 0xD800 and 0xDBFF.
> I cannot find whether this issue has been reported before. This is version 
> 1.10.  I do have a fix to add a bool reference to the function, so that the 
> caller can push the last 2 byte back to the buffer if not consumed. But want 
> to check it out before submit any fixes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (XALANC-743) XalanOutputStream::transcode falls into infinite loop on 4 bytes unicode till out of memory

Reply via email to