On Mon, 30 Oct 2023 18:34:44 GMT, Roger Riggs <rri...@openjdk.org> wrote:
> Strings, after construction, are immutable but may be constructed from > mutable arrays of bytes, characters, or integers. > The string constructors should guard against the effects of mutating the > arrays during construction that might invalidate internal invariants for the > correct behavior of operations on the resulting strings. In particular, a > number of operations have optimizations for operations on pairs of latin1 > strings and pairs of non-latin1 strings, while operations between latin1 and > non-latin1 strings use a more general implementation. > > The changes include: > > - Adding a warning to each constructor with an array as an argument to > indicate that the results are indeterminate > if the input array is modified before the constructor returns. > The resulting string may contain any combination of characters sampled from > the input array. > > - Ensure that strings that are represented as non-latin1 contain at least one > non-latin1 character. > For latin1 inputs, whether the arrays contain ASCII, ISO-8859-1, UTF8, or > another encoding decoded to latin1 the scanning and compression is unchanged. > If a non-latin1 character is found, the string is represented as non-latin1 > with the added verification that a non-latin1 character is present at the > same index. > If that character is found to be latin1, then the input array has been > modified and the result of the scan may be incorrect. > Though a ConcurrentModificationException could be thrown, the risk to an > existing application of an unexpected exception should be avoided. > Instead, the non-latin1 copy of the input is re-scanned and compressed; > that scan determines whether the latin1 or the non-latin1 representation is > returned. > > - The methods that scan for non-latin1 characters and their intrinsic > implementations are updated to return the index of the non-latin1 character. > > - String construction from StringBuilder and CharSequence must also be > guarded as their contents may be modified during construction. First take ;-) More will follow. src/java.base/share/classes/java/lang/String.java line 602: > 600: } > 601: this.value = utf16; > 602: this.coder = (utf16.length == dp) ? LATIN1 : UTF16; Is it possible to have `utf16.length == dp` here? I think the `coder` can only be `UTF16`. src/java.base/share/classes/java/lang/StringLatin1.java line 37: > 35: import jdk.internal.util.ArraysSupport; > 36: import jdk.internal.util.DecimalDigits; > 37: import jdk.internal.vm.annotation.ForceInline; This annotation does not seem to be used. src/java.base/share/classes/java/lang/StringUTF16.java line 158: > 156: * {@return an encoded byte[] for the UTF16 characters in char[]} > 157: * **Only** use this if it is known that at least one character is > UTF16. > 158: * Otherwise, an untrusted char array may have racy contents and > really be latin1. While this is a good advice, it turns out that the `compress()` method below invokes this method _without_ knowing for sure if the provided `value` contains at least one non-latin1 `char` when this method is invoked or while it runs: it only _assumes_ so, and indeed prudentially checks afterwards. It should be suggested to check the result after invoking this method if `value` is untrusted. src/java.base/share/classes/java/lang/StringUTF16.java line 198: > 196: * @param val a char array > 197: * @param off starting offset > 198: * @param count length of chars to be compressed, length > 0 Suggestion: * @param count count of chars to be compressed, {@code count} > 0 src/java.base/share/classes/java/lang/StringUTF16.java line 214: > 212: } > 213: } > 214: return latin1; // latin1 success The original version of this `public` method can return `null` to signal failure to compress. Does this change impact callers that might expect `null`? src/java.base/share/classes/java/lang/StringUTF16.java line 226: > 224: * @param val a byte array with UTF16 coding > 225: * @param off starting offset > 226: * @param count length of chars to be compressed, length > 0 Suggestion: * @param count count of chars to be compressed, {@code count} > 0 src/java.base/share/classes/java/lang/StringUTF16.java line 232: > 230: int ndx = compress(val, off, latin1, 0, count); > 231: if (ndx != count) {// Switch to UTF16 > 232: byte[] utf16 = Arrays.copyOfRange(val, off << 1, (off + > count) << 1); Not sure if the left shifts do not overflow on this `public` method. If that happens, the outcomes could be non-negative, so the copy would succeed but be kind of corrupted. src/java.base/share/classes/java/lang/StringUTF16.java line 240: > 238: } > 239: } > 240: return latin1; // latin1 success See the `compress()` above for a remark on `null` as a return value. src/java.base/share/classes/java/lang/StringUTF16.java line 319: > 317: int codePoint = val[off]; // read each codepoint > from val only once > 318: int dstLimit = dstOff > 319: + (Character.isBmpCodePoint(codePoint) ? 1 : 2) Suggestion: + Character.charCount(codePoint) This method was introduced in 1.5, so should be safe to use even for backports. src/java.base/share/classes/java/lang/StringUTF16.java line 411: > 409: return 2; > 410: } else > 411: throw new > IllegalArgumentException(Integer.toString(codePoint)); Maybe `Character.charCount()` can be used here, although it returns 2 even for invalid codepoints. ------------- PR Review: https://git.openjdk.org/jdk/pull/16425#pullrequestreview-1720680695 PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1386992886 PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1386829351 PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1386885841 PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1386905655 PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1386923763 PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1386905964 PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1386915873 PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1386927514 PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1386951908 PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1386997067