Re: RFR: 8311906: Improve robustness of String constructors with mutable array inputs [v2]

Amit Kumar Mon, 13 Nov 2023 03:26:15 -0800

On Thu, 9 Nov 2023 04:16:25 GMT, Roger Riggs <[email protected]> wrote:


>> Strings, after construction, are immutable but may be constructed from 
>> mutable arrays of bytes, characters, or integers.
>> The string constructors should guard against the effects of mutating the 
>> arrays during construction that might invalidate internal invariants for the 
>> correct behavior of operations on the resulting strings. In particular, a 
>> number of operations have optimizations for operations on pairs of latin1 
>> strings and pairs of non-latin1 strings, while operations between latin1 and 
>> non-latin1 strings use a more general implementation. 
>> 
>> The changes include:
>> 
>> - Adding a warning to each constructor with an array as an argument to 
>> indicate that the results are indeterminate 
>>   if the input array is modified before the constructor returns. 
>>   The resulting string may contain any combination of characters sampled 
>> from the input array.
>> 
>> - Ensure that strings that are represented as non-latin1 contain at least 
>> one non-latin1 character.
>>   For latin1 inputs, whether the arrays contain ASCII, ISO-8859-1, UTF8, or 
>> another encoding decoded to latin1 the scanning and compression is unchanged.
>>   If a non-latin1 character is found, the string is represented as 
>> non-latin1 with the added verification that a non-latin1 character is 
>> present at the same index.
>>   If that character is found to be latin1, then the input array has been 
>> modified and the result of the scan may be incorrect.
>>   Though a ConcurrentModificationException could be thrown, the risk to an 
>> existing application of an unexpected exception should be avoided.
>>   Instead, the non-latin1 copy of the input is re-scanned and compressed; 
>> that scan determines whether the latin1 or the non-latin1 representation is 
>> returned.
>> 
>> - The methods that scan for non-latin1 characters and their intrinsic 
>> implementations are updated to return the index of the non-latin1 character.
>> 
>> - String construction from StringBuilder and CharSequence must also be 
>> guarded as their contents may be modified during construction.
>
> Roger Riggs has updated the pull request incrementally with three additional 
> commits since the last revision:
> 
>  - Refactored extractCodePoints to avoid multiple resizes if the array was 
> modified
>  - Replaced isLatin1 implementation with `getChar(buf, ndx) <= 0xff`
>    It performs better than the single byte array access by avoiding the 
> bounds check.
>  - Misc updates for review comments, javadoc cleanup
>    Extra checking on maximum string lengths when calling toBytes().

Please add s390 port: 


diff --git a/src/hotspot/cpu/s390/s390.ad b/src/hotspot/cpu/s390/s390.ad
index ffac6b70a58..61b6a6a5906 100644
--- a/src/hotspot/cpu/s390/s390.ad
+++ b/src/hotspot/cpu/s390/s390.ad
@@ -10190,7 +10190,7 @@ instruct string_compress(iRegP src, iRegP dst, iRegI 
result, iRegI len, iRegI tm
   format %{ "String Compress $src->$dst($len) -> $result" %}
   ins_encode %{
     __ string_compress($result$$Register, $src$$Register, $dst$$Register, 
$len$$Register,
-                       $tmp$$Register, false, false);
+                       $tmp$$Register, true, false);
   %}
   ins_pipe(pipe_class_dummy);
 %}

-------------

PR Comment: https://git.openjdk.org/jdk/pull/16425#issuecomment-1807971207

Re: RFR: 8311906: Improve robustness of String constructors with mutable array inputs [v2]

Reply via email to