java.xml) 8207760: SAXException: Invalid UTF-16 surrogate detected: d83c ?

Joe Wang Mon, 17 Sep 2018 11:46:12 -0700

Yikes, right, the process must avoid advancing the index if the pair iswritten when the current char is low surrogate.


webrev02:
http://cr.openjdk.java.net/~joehw/jdk12/8207760/webrev02/


Thanks,
Joe

On 9/17/18, 3:10 AM, Daniel Fuchs wrote:

Hi Joe,

hmmm... I'm not sure I'm completely convinced.

Let's take an example, and assume your input
has been split in two strings of 4 characters each
(the code I'm seeing implies that this is possible,
right?) In the sketch below the first line represents
the loop iterations, the second line represents
the two strings as char buffers (hs = high surrogate,
ls = low surrogate, x,y = just regular chars), the third
line represents index increments, and the fourth line
represents the result of the writeUTF16Surrogate method:

L1:   1   -  2    3        1      2    3   -
L2: [ hs  ls x    hs ]   [ ls     y    hs  ls ]
L3:   \i+2/  i++  i++      i++    i++  \i+2/
L4:   cp||0       -1       cp||-1      cp||0

We have two calls to writeAttrString, the first
call writes the first 4 chars string, the second
write the next 4 chars.

In the sketch above, i++ indicates a regular
increment from the loop, and i+2 indicates
where i++ must be called a second time within
the loop because two characters were consumed
in a single iteration.

So it seems to me that the only time you want to
increment the index a second time is when you read
(and wrote) both surrogate in one shot.

In the pathological case - where the surrogate pair
was split at the string limit, there will be two
calls to writeUTF16Surrogate(...). One with the
high surrogate, that will return -1, and one with
the low surrogate that will either return -1 or a
code point. And in that case you don't want to
increment i a second time - or the character
'y' will be skipped.

Am I still missing something?

best regards,

-- daniel


On 14/09/2018 18:08, Joe Wang wrote:
Hi Daniel,
The additional advance is made when the pair has already beenwritten, which is indicated by the return value ofwriteUTF16Surrogate being >= 0*. Encodings.isHighUTF16Surrogate(ch)therefore would be redundant.
* Note that the return value is: -1 when nothing is written, 0 whenthe pair is written, so the condition of >= 0 means no matter thereis a codepoint value or not, the index increment as long as the pairis written (the low surrogate is consumed).
Best,
Joe

On 9/14/18, 3:00 AM, Daniel Fuchs wrote:
Hi Joe,

Thanks for doing that - it's a far better solution.

ToHTMLStream.java:

1432    // move the index if the surrogate pair has been written.
1433    if (writeUTF16Surrogate(ch, chars, i, end) >= 0) {
1434        i++;
1435    }

shouldn't this be:

1433    if (writeUTF16Surrogate(ch, chars, i, end) >= 0) {
            if (Encodings.isHighUTF16Surrogate(ch)) {
                // two input characters processed, increase
                // the index again.
                i++;
            }

IIUC you only want to increase the index if the ch was the
high surrogate and the function has advanced to the low
surrogate?

I mean - a codepoint could have been returned if ch was the
low surrogate, and in that case you don't want to increase
the index twice as only one character has been consumed.

I guess there's the same issue in ToStream.java at lines
1154-1156 and in ToTextStream.java at line 303...

Or am I missing something?

best regards,

-- daniel


On 14/09/2018 04:13, Joe Wang wrote:
Thanks Daniel.
I changed the return of writeUTF16Surrogate so that it is clearerwithin writeUTF16Surrogate when an additional index increment isneeded. Other corresponding changes are in ToHTMLStream andToTextStream where similar calls to the method are made. It's beenan issue in some part of JAXP impl where error or warning messagesare printed out to the console (e.g. JDK-8000621). But I kept it asis in ToTextStream for this patch.
webrev01:
http://cr.openjdk.java.net/~joehw/jdk12/8207760/webrev01/

Best,
Joe

On 9/13/18, 2:23 AM, Daniel Fuchs wrote:
Hi Joe,

On 13/09/2018 00:25, Lance Andersen wrote:
Hi Joe,

The change  seems reasonable
Agreed. However the following condition in ToStream::handleEscaping
is a bit cryptic:

1155             if ((ihs && (i + 1 < end)) || (ils && i != 0)) {
1156                 i++ ; // process two input characters
1157             }

could the comment be fleshed out to explain it?

I suspect that: `(ihs && (i + 1 < end))` means that
`writeUTF16Surrogate(c, ch, i, end);` has written the twosurrogate, in
which case i should be incremented in order to skip the low surrogate
which has just been written.

I am not sure what `(ils && i != 0)` means, though...

best regards

-- daniel
On Sep 12, 2018, at 2:11 PM, Joe Wang <[email protected]>wrote:
Hi,
Please review a patch for a situation where a surrogate pair isat the edge of a buffer. What the existing impl did was toreport it as an error. This patch fixes it by caching the highsurrogate and prints it out along with the low surrogate.Similar issue exists also in the CDATA section and is fixed inthis patch. The CDATA impl had a couple of bugs where an indentcould be written inside the CDATA and an unicode characterwritten in between two CDATA sections. Both are fixed in thispatch.
JBS: https://bugs.openjdk.java.net/browse/JDK-8207760
webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8207760/webrev/

Thanks,
Joe
<http://oracle.com/us/design/oracle-email-sig-198324.gif>
<http://oracle.com/us/design/oracle-email-sig-198324.gif><http://oracle.com/us/design/oracle-email-sig-198324.gif><http://oracle.com/us/design/oracle-email-sig-198324.gif>LanceAndersen| Principal Member of Technical Staff | +1.781.442.2037
Oracle Java Engineering
1 Network Drive
Burlington, MA 01803
[email protected] <mailto:[email protected]>

Re: RFR(JDK12/JAXP/java.xml) 8207760: SAXException: Invalid UTF-16 surrogate detected: d83c ?

Reply via email to