On Fri, 10 Nov 2023 14:59:57 GMT, Raffaello Giulietti <rgiulie...@openjdk.org> 
wrote:

>> Roger Riggs has updated the pull request incrementally with three additional 
>> commits since the last revision:
>> 
>>  - Refactored extractCodePoints to avoid multiple resizes if the array was 
>> modified
>>  - Replaced isLatin1 implementation with `getChar(buf, ndx) <= 0xff`
>>    It performs better than the single byte array access by avoiding the 
>> bounds check.
>>  - Misc updates for review comments, javadoc cleanup
>>    Extra checking on maximum string lengths when calling toBytes().
>
> src/java.base/share/classes/java/lang/StringUTF16.java line 279:
> 
>> 277:             } else {
>> 278:                 // Pass 1: Compute precise size of char[]; see 
>> extractCodePoints for caveat
>> 279:                 int estSize = ndx + computeCodePointSize(val, off, end);
> 
> To avoid reallocations in `extractCodepoints()`, a way is to profit from the 
> presence of `latin1[]`, putting `latin1[i] = (byte) (cp >>> 16)`, starting 
> from `ndx`, while scanning the `val[]` in `computeCodePointSize()` to 
> preserve information about the upper bits of the codepoint.
> 
> Later, while copying the `val[]` codepoints to `utf16[]`, the info in 
> `latin1[]` is included in the `cp` just read from `val[]`, for example as in 
> `cp = (cp & 0xffff) | ((latin1[i] & 0xff) << 16)`.
> 
> The resulting codepoint would be BMP or supplementary as when it was scanned 
> during `computeCodePointSize()`, even in presence of later modifications, and 
> preserving the original value if it wasn't modified in the meantime. Since 
> the info about a codepoint needing 1 or 2 chars in `utf16[]` is preserved in 
> `latin1[]`, there should be no need for reallocations.

Interesting idea, but it might mean that if the codepoint val[i] was modified 
it could result in a cp that did not (ever) exist in the input; creating a 
value out of thin air.  The high bits would be from the first read of val[i] 
and the low bits from the 2nd read. The code in extractCodepoints could be 
simpler and computeCodePointSize just a little more comples. Creating values 
out of thin air is usually bad and could have (different) unexpected results in 
the app.

The additional writes to the latin1 array would also slow down the normal case 
of computing the size whether or not the input was modified.
On that basis, I'd keep the current approach to resizing only if needed.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1389637034

Reply via email to