On Mon, 30 Oct 2023 18:34:44 GMT, Roger Riggs <[email protected]> wrote:
> Strings, after construction, are immutable but may be constructed from
> mutable arrays of bytes, characters, or integers.
> The string constructors should guard against the effects of mutating the
> arrays during construction that might invalidate internal invariants for the
> correct behavior of operations on the resulting strings. In particular, a
> number of operations have optimizations for operations on pairs of latin1
> strings and pairs of non-latin1 strings, while operations between latin1 and
> non-latin1 strings use a more general implementation.
>
> The changes include:
>
> - Adding a warning to each constructor with an array as an argument to
> indicate that the results are indeterminate
> if the input array is modified before the constructor returns.
> The resulting string may contain any combination of characters sampled from
> the input array.
>
> - Ensure that strings that are represented as non-latin1 contain at least one
> non-latin1 character.
> For latin1 inputs, whether the arrays contain ASCII, ISO-8859-1, UTF8, or
> another encoding decoded to latin1 the scanning and compression is unchanged.
> If a non-latin1 character is found, the string is represented as non-latin1
> with the added verification that a non-latin1 character is present at the
> same index.
> If that character is found to be latin1, then the input array has been
> modified and the result of the scan may be incorrect.
> Though a ConcurrentModificationException could be thrown, the risk to an
> existing application of an unexpected exception should be avoided.
> Instead, the non-latin1 copy of the input is re-scanned and compressed;
> that scan determines whether the latin1 or the non-latin1 representation is
> returned.
>
> - The methods that scan for non-latin1 characters and their intrinsic
> implementations are updated to return the index of the non-latin1 character.
>
> - String construction from StringBuilder and CharSequence must also be
> guarded as their contents may be modified during construction.
First take ;-)
More will follow.
src/java.base/share/classes/java/lang/String.java line 602:
> 600: }
> 601: this.value = utf16;
> 602: this.coder = (utf16.length == dp) ? LATIN1 : UTF16;
Is it possible to have `utf16.length == dp` here? I think the `coder` can only
be `UTF16`.
src/java.base/share/classes/java/lang/StringLatin1.java line 37:
> 35: import jdk.internal.util.ArraysSupport;
> 36: import jdk.internal.util.DecimalDigits;
> 37: import jdk.internal.vm.annotation.ForceInline;
This annotation does not seem to be used.
src/java.base/share/classes/java/lang/StringUTF16.java line 158:
> 156: * {@return an encoded byte[] for the UTF16 characters in char[]}
> 157: * **Only** use this if it is known that at least one character is
> UTF16.
> 158: * Otherwise, an untrusted char array may have racy contents and
> really be latin1.
While this is a good advice, it turns out that the `compress()` method below
invokes this method _without_ knowing for sure if the provided `value` contains
at least one non-latin1 `char` when this method is invoked or while it runs: it
only _assumes_ so, and indeed prudentially checks afterwards.
It should be suggested to check the result after invoking this method if
`value` is untrusted.
src/java.base/share/classes/java/lang/StringUTF16.java line 198:
> 196: * @param val a char array
> 197: * @param off starting offset
> 198: * @param count length of chars to be compressed, length > 0
Suggestion:
* @param count count of chars to be compressed, {@code count} > 0
src/java.base/share/classes/java/lang/StringUTF16.java line 214:
> 212: }
> 213: }
> 214: return latin1; // latin1 success
The original version of this `public` method can return `null` to signal
failure to compress. Does this change impact callers that might expect `null`?
src/java.base/share/classes/java/lang/StringUTF16.java line 226:
> 224: * @param val a byte array with UTF16 coding
> 225: * @param off starting offset
> 226: * @param count length of chars to be compressed, length > 0
Suggestion:
* @param count count of chars to be compressed, {@code count} > 0
src/java.base/share/classes/java/lang/StringUTF16.java line 232:
> 230: int ndx = compress(val, off, latin1, 0, count);
> 231: if (ndx != count) {// Switch to UTF16
> 232: byte[] utf16 = Arrays.copyOfRange(val, off << 1, (off +
> count) << 1);
Not sure if the left shifts do not overflow on this `public` method. If that
happens, the outcomes could be non-negative, so the copy would succeed but be
kind of corrupted.
src/java.base/share/classes/java/lang/StringUTF16.java line 240:
> 238: }
> 239: }
> 240: return latin1; // latin1 success
See the `compress()` above for a remark on `null` as a return value.
src/java.base/share/classes/java/lang/StringUTF16.java line 319:
> 317: int codePoint = val[off]; // read each codepoint
> from val only once
> 318: int dstLimit = dstOff
> 319: + (Character.isBmpCodePoint(codePoint) ? 1 : 2)
Suggestion:
+ Character.charCount(codePoint)
This method was introduced in 1.5, so should be safe to use even for backports.
src/java.base/share/classes/java/lang/StringUTF16.java line 411:
> 409: return 2;
> 410: } else
> 411: throw new
> IllegalArgumentException(Integer.toString(codePoint));
Maybe `Character.charCount()` can be used here, although it returns 2 even for
invalid codepoints.
-------------
PR Review: https://git.openjdk.org/jdk/pull/16425#pullrequestreview-1720680695
PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1386992886
PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1386829351
PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1386885841
PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1386905655
PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1386923763
PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1386905964
PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1386915873
PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1386927514
PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1386951908
PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1386997067