Re: RFR: 8311906: Improve robustness of String constructors with mutable array inputs

Roger Riggs Wed, 08 Nov 2023 12:33:40 -0800

On Wed, 8 Nov 2023 16:47:17 GMT, Raffaello Giulietti <[email protected]> 
wrote:


>> Strings, after construction, are immutable but may be constructed from 
>> mutable arrays of bytes, characters, or integers.
>> The string constructors should guard against the effects of mutating the 
>> arrays during construction that might invalidate internal invariants for the 
>> correct behavior of operations on the resulting strings. In particular, a 
>> number of operations have optimizations for operations on pairs of latin1 
>> strings and pairs of non-latin1 strings, while operations between latin1 and 
>> non-latin1 strings use a more general implementation. 
>> 
>> The changes include:
>> 
>> - Adding a warning to each constructor with an array as an argument to 
>> indicate that the results are indeterminate 
>>   if the input array is modified before the constructor returns. 
>>   The resulting string may contain any combination of characters sampled 
>> from the input array.
>> 
>> - Ensure that strings that are represented as non-latin1 contain at least 
>> one non-latin1 character.
>>   For latin1 inputs, whether the arrays contain ASCII, ISO-8859-1, UTF8, or 
>> another encoding decoded to latin1 the scanning and compression is unchanged.
>>   If a non-latin1 character is found, the string is represented as 
>> non-latin1 with the added verification that a non-latin1 character is 
>> present at the same index.
>>   If that character is found to be latin1, then the input array has been 
>> modified and the result of the scan may be incorrect.
>>   Though a ConcurrentModificationException could be thrown, the risk to an 
>> existing application of an unexpected exception should be avoided.
>>   Instead, the non-latin1 copy of the input is re-scanned and compressed; 
>> that scan determines whether the latin1 or the non-latin1 representation is 
>> returned.
>> 
>> - The methods that scan for non-latin1 characters and their intrinsic 
>> implementations are updated to return the index of the non-latin1 character.
>> 
>> - String construction from StringBuilder and CharSequence must also be 
>> guarded as their contents may be modified during construction.
>
> src/java.base/share/classes/java/lang/StringUTF16.java line 214:
> 
>> 212:             }
>> 213:         }
>> 214:         return latin1;     // latin1 success
> 
> The original version of this `public` method can return `null` to signal 
> failure to compress. Does this change impact callers that might expect `null`?

It is public but in a package private class; all of the uses have been updated.

> src/java.base/share/classes/java/lang/StringUTF16.java line 232:
> 
>> 230:         int ndx = compress(val, off, latin1, 0, count);
>> 231:         if (ndx != count) {// Switch to UTF16
>> 232:             byte[] utf16 = Arrays.copyOfRange(val, off << 1, (off + 
>> count) << 1);
> 
> Not sure if the left shifts do not overflow on this `public` method. If that 
> happens, the outcomes could be non-negative, so the copy would succeed but be 
> kind of corrupted.

These deserve the same kind of check as used in StringUTF16.newBytesFor(len).

> src/java.base/share/classes/java/lang/StringUTF16.java line 411:
> 
>> 409:             return 2;
>> 410:         } else
>> 411:             throw new 
>> IllegalArgumentException(Integer.toString(codePoint));
> 
> Maybe `Character.charCount()` can be used here, although it returns 2 even 
> for invalid codepoints.

The check and exception is specified in the constructor `public String(int[] 
codePoints, int offset, int count)`
so its needed in at least one pass over the input.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1387164695
PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1387163869
PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1387167119

Re: RFR: 8311906: Improve robustness of String constructors with mutable array inputs

Reply via email to