[
https://issues.apache.org/jira/browse/XALANC-743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829379#comment-17829379
]
Daehyeon Kim edited comment on XALANC-743 at 4/2/24 5:51 AM:
-------------------------------------------------------------
I also met this problem. Appears to be a very similar issue to the issue
reported in XALANJ-2419. If transforming when Unicode supplementary characters
are included in the input, the output will be corrupted. Also confirmed that
both Xalan-C++ versions 1.10 to 1.12 have the same problem.
This problem appears for for both UTF-16 output encoding and UTF-8 output
encoding. When using UTF16Writer, supplementary characters are converted to
broken characters such as "??". If using UTF8Writer, a more serious problem
arises where no output results are obtained. This suggests a higher-level issue
than serialization, with something suspicious found in the XPath
FunctionSubstring implementation. There's no issue with UTF16Writer and
UTF8Writer.
XPath FunctionSubstring takes a character index position as an argument. A
surrogate pair needs to be counted as one character (as per the XPath
Recommendation). Thus, the string buffer positions where the surrogate pair is
considered at the character index need to be counted. Currently, Xalan
truncates the string buffer only with the arguments received for the character
index. This will causing the truncated string to be corrupted when the
surrogate pair is in the string buffer.
The same applies to this issue. If a string containing supplementary characters
is incorrectly truncated in FunctionSubstring, surrogate pairs (which cannot
appear in UTF-8) may surprisingly appear in UTF8Writer (as evident in the
assertion in UTF8Writer "// We should never get a high or low surrogate
here..."). Therefore, fixing XPath FunctionSubstring to count the string data
length considering the surrogate pair resolves this issue automatically.
Please refer to the pull request for the changed code.
was (Author: JIRAUSER304680):
I also met this problem. Appears to be a very similar issue to the issue
reported in XALANJ-2419. If transforming when Unicode supplementary characters
are included in the input, the output will be corrupted. Also confirmed that
both Xalan-C++ versions 1.10 to 1.12 have the same problem.
This problem appears for for both UTF-16 output encoding and UTF-8 output
encoding. When using UTF16Writer, supplementary characters are converted to
broken characters such as "??". If using UTF8Writer, a more serious problem
arises where no output results are obtained. This suggests a higher-level issue
than serialization, with something suspicious found in the XPath
FunctionSubstring implementation. There's no issue with UTF16Writer and
UTF8Writer.
XPath FunctionSubstring takes a character index position as an argument. A
surrogate pair needs to be counted as one character (as per the XPath
Recommendation). Thus, the string buffer positions where the surrogate pair is
considered at the character index need to be counted. Currently, Xalan
truncates the string buffer only with the arguments received for the character
index, causing the truncated string to be corrupted when the surrogate pair is
in the string buffer.
The same applies to this issue. If a string containing supplementary characters
is incorrectly truncated in FunctionSubstring, surrogate pairs (which cannot
appear in UTF-8) may surprisingly appear in UTF8Writer (as evident in the
assertion in UTF8Writer "// We should never get a high or low surrogate
here..."). Therefore, fixing XPath FunctionSubstring to count the string data
length considering the surrogate pair resolves this issue automatically.
Please refer to the pull request for the changed code.
> XalanOutputStream::transcode falls into infinite loop on 4 bytes unicode till
> out of memory
> -------------------------------------------------------------------------------------------
>
> Key: XALANC-743
> URL: https://issues.apache.org/jira/browse/XALANC-743
> Project: XalanC
> Issue Type: Bug
> Components: XalanC
> Affects Versions: 1.10
> Environment: Linux
> Reporter: Jiangbei Fan
> Assignee: Steven J. Hathaway
> Priority: Major
>
> In some rare cases, XalanTransformer::transform would stuck or crash when the
> input/stylesheet contains 4-byte unicode. And I traced down the root cause in
> XalanOutputStream::transcode
> When the transcode buffer contains unicode of size 4 bytes, and the last
> XalanDOMChar in the buffer is the first 2 bytes of a 4-byte unicode char. The
> XalanOutputStream::transcode will fall into an infinite loop till it is out
> of memory. As XMLUTF8Transcoder.cpp in xerces will not consume the last
> 2-bytes if it is part of 4 byte unicode. And transcode always loop until all
> chars in the buffer is eaten. Specifically this will happen when the last
> XalanDOMChar in the input buffer is between 0xD800 and 0xDBFF.
> I cannot find whether this issue has been reported before. This is version
> 1.10. I do have a fix to add a bool reference to the function, so that the
> caller can push the last 2 byte back to the buffer if not consumed. But want
> to check it out before submit any fixes.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]