On Wed, 8 Nov 2023 16:47:17 GMT, Raffaello Giulietti <rgiulie...@openjdk.org> wrote:
>> Strings, after construction, are immutable but may be constructed from >> mutable arrays of bytes, characters, or integers. >> The string constructors should guard against the effects of mutating the >> arrays during construction that might invalidate internal invariants for the >> correct behavior of operations on the resulting strings. In particular, a >> number of operations have optimizations for operations on pairs of latin1 >> strings and pairs of non-latin1 strings, while operations between latin1 and >> non-latin1 strings use a more general implementation. >> >> The changes include: >> >> - Adding a warning to each constructor with an array as an argument to >> indicate that the results are indeterminate >> if the input array is modified before the constructor returns. >> The resulting string may contain any combination of characters sampled >> from the input array. >> >> - Ensure that strings that are represented as non-latin1 contain at least >> one non-latin1 character. >> For latin1 inputs, whether the arrays contain ASCII, ISO-8859-1, UTF8, or >> another encoding decoded to latin1 the scanning and compression is unchanged. >> If a non-latin1 character is found, the string is represented as >> non-latin1 with the added verification that a non-latin1 character is >> present at the same index. >> If that character is found to be latin1, then the input array has been >> modified and the result of the scan may be incorrect. >> Though a ConcurrentModificationException could be thrown, the risk to an >> existing application of an unexpected exception should be avoided. >> Instead, the non-latin1 copy of the input is re-scanned and compressed; >> that scan determines whether the latin1 or the non-latin1 representation is >> returned. >> >> - The methods that scan for non-latin1 characters and their intrinsic >> implementations are updated to return the index of the non-latin1 character. >> >> - String construction from StringBuilder and CharSequence must also be >> guarded as their contents may be modified during construction. > > src/java.base/share/classes/java/lang/StringUTF16.java line 214: > >> 212: } >> 213: } >> 214: return latin1; // latin1 success > > The original version of this `public` method can return `null` to signal > failure to compress. Does this change impact callers that might expect `null`? It is public but in a package private class; all of the uses have been updated. > src/java.base/share/classes/java/lang/StringUTF16.java line 232: > >> 230: int ndx = compress(val, off, latin1, 0, count); >> 231: if (ndx != count) {// Switch to UTF16 >> 232: byte[] utf16 = Arrays.copyOfRange(val, off << 1, (off + >> count) << 1); > > Not sure if the left shifts do not overflow on this `public` method. If that > happens, the outcomes could be non-negative, so the copy would succeed but be > kind of corrupted. These deserve the same kind of check as used in StringUTF16.newBytesFor(len). > src/java.base/share/classes/java/lang/StringUTF16.java line 411: > >> 409: return 2; >> 410: } else >> 411: throw new >> IllegalArgumentException(Integer.toString(codePoint)); > > Maybe `Character.charCount()` can be used here, although it returns 2 even > for invalid codepoints. The check and exception is specified in the constructor `public String(int[] codePoints, int offset, int count)` so its needed in at least one pass over the input. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1387164695 PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1387163869 PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1387167119