On Sat, 4 Nov 2023 00:07:33 GMT, Chen Liang <li...@openjdk.org> wrote:

>> Strings, after construction, are immutable but may be constructed from 
>> mutable arrays of bytes, characters, or integers.
>> The string constructors should guard against the effects of mutating the 
>> arrays during construction that might invalidate internal invariants for the 
>> correct behavior of operations on the resulting strings. In particular, a 
>> number of operations have optimizations for operations on pairs of latin1 
>> strings and pairs of non-latin1 strings, while operations between latin1 and 
>> non-latin1 strings use a more general implementation. 
>> 
>> The changes include:
>> 
>> - Adding a warning to each constructor with an array as an argument to 
>> indicate that the results are indeterminate 
>>   if the input array is modified before the constructor returns. 
>>   The resulting string may contain any combination of characters sampled 
>> from the input array.
>> 
>> - Ensure that strings that are represented as non-latin1 contain at least 
>> one non-latin1 character.
>>   For latin1 inputs, whether the arrays contain ASCII, ISO-8859-1, UTF8, or 
>> another encoding decoded to latin1 the scanning and compression is unchanged.
>>   If a non-latin1 character is found, the string is represented as 
>> non-latin1 with the added verification that a non-latin1 character is 
>> present at the same index.
>>   If that character is found to be latin1, then the input array has been 
>> modified and the result of the scan may be incorrect.
>>   Though a ConcurrentModificationException could be thrown, the risk to an 
>> existing application of an unexpected exception should be avoided.
>>   Instead, the non-latin1 copy of the input is re-scanned and compressed; 
>> that scan determines whether the latin1 or the non-latin1 representation is 
>> returned.
>> 
>> - The methods that scan for non-latin1 characters and their intrinsic 
>> implementations are updated to return the index of the non-latin1 character.
>> 
>> - String construction from StringBuilder and CharSequence must also be 
>> guarded as their contents may be modified during construction.
>
> src/java.base/share/classes/java/lang/StringUTF16.java line 202:
> 
>> 200:     @ForceInline
>> 201:     public static byte[] compress(final char[] val, final int off, 
>> final int count) {
>> 202:         byte[] latin1 = new byte[count];
> 
> Will this redundant array allocation be costly if we are working with 
> mostly-utf16 strings, such as CJK strings with no latin characters?
> 
> I suggest we can use a heuristic to read the initial char; if it's utf16 then 
> we skip the latin-1 process altogether (and we can assign the utf16 value to 
> the initial index to ensure it's non-latin-1 compressible.

We can reconsider this design as a separate PR. 
Every additional check has a performance impact and in this bug the goal is to 
avoid any regression.

We'll need to gain some insight into the distribution of strings when used in a 
non-latin1 application.
How many of the strings are latin1 vs non-latin1, what is the distribution of 
string lengths and which APIs are in use in the applications.  The 
implementation is already pretty good about working with strings of different 
coders
but there may be some different choices when converting between char arrays and 
int arrays and strings.

> src/java.base/share/classes/java/lang/StringUTF16.java line 411:
> 
>> 409:             return 2;
>> 410:         } else
>> 411:             throw new 
>> IllegalArgumentException(Integer.toString(codePoint));
> 
> `toHexString` might be more informative.

Perhaps, but changing the exception text is out scope for this PR; it has been 
decimal since JDK 9 (2015).

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1383521445
PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1383527621

Reply via email to