On Mon, 6 Nov 2023 15:30:46 GMT, Roger Riggs <rri...@openjdk.org> wrote:

>> src/java.base/share/classes/java/lang/StringUTF16.java line 202:
>> 
>>> 200:     @ForceInline
>>> 201:     public static byte[] compress(final char[] val, final int off, 
>>> final int count) {
>>> 202:         byte[] latin1 = new byte[count];
>> 
>> Will this redundant array allocation be costly if we are working with 
>> mostly-utf16 strings, such as CJK strings with no latin characters?
>> 
>> I suggest we can use a heuristic to read the initial char; if it's utf16 
>> then we skip the latin-1 process altogether (and we can assign the utf16 
>> value to the initial index to ensure it's non-latin-1 compressible.
>
> We can reconsider this design as a separate PR. 
> Every additional check has a performance impact and in this bug the goal is 
> to avoid any regression.
> 
> We'll need to gain some insight into the distribution of strings when used in 
> a non-latin1 application.
> How many of the strings are latin1 vs non-latin1, what is the distribution of 
> string lengths and which APIs are in use in the applications.  The 
> implementation is already pretty good about working with strings of different 
> coders
> but there may be some different choices when converting between char arrays 
> and int arrays and strings.

Just curious, how does benchmark StringConstructor.newStringFromCharsMixedBegin 
change before and after this patch? If we can see how much of an impact this 
has on CJK strings it would be appreciated.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1387693255

Reply via email to