On Sun, 19 Feb 2023 07:24:30 GMT, David Schlosnagle <d...@openjdk.org> wrote:

>> When encoding Strings to US-ASCII we can speed up the happy path 
>> significantly by using `StringCoding.countPositives` as a speculative check 
>> for whether there are any chars that needs to be replaced by `'?'`. Once a 
>> non-ASCII char is encountered we fall back to the slow loop and replace as 
>> needed.
>> 
>> An alternative could be unrolling or using a byte array VarHandle, as 
>> show-cased by Brett Okken here: 
>> https://mail.openjdk.org/pipermail/core-libs-dev/2023-February/100573.html 
>> Having to replace chars with `?` is essentially an encoding error so it 
>> might be safe to assume this case is exceptional in practice.
>
> src/java.base/share/classes/java/lang/String.java line 976:
> 
>> 974:     private static byte[] encodeASCII(byte coder, byte[] val) {
>> 975:         if (coder == LATIN1) {
>> 976:             byte[] dst = Arrays.copyOf(val, val.length);
> 
> Given the tweaks in https://git.openjdk.org/jdk/pull/12613 should this use 
> `val.clone()` (would skip the length check)
> 
> Suggestion:
> 
>             byte[] dst = val.clone();

Yeah, probably makes sense. On that note I found that `val.clone()` 
underperform `Arrays.copyOf(val, val.length)` in C1 compiled code: 
https://bugs.openjdk.org/browse/JDK-8302850 - while this shouldn't affect peak 
performance it might be cause for a warmup regression.

> src/java.base/share/classes/java/lang/String.java line 982:
> 
>> 980:                     if (dst[i] < 0) {
>> 981:                         dst[i] = '?';
>> 982:                     }
> 
> I'm curious if using countPositives (and vectorization) to scan forward would 
> be valuable for long (mostly ASCII) strings or if the method call 
> overhead/non-constant stride is not a win for shorter strings or heavily 
> non-ascii inputs. 
> 
> Suggestion:
> 
>                 for (int i = positives; i < dst.length; i = 
> StringCoding.countPositives(dst, i + 1, dst.length - i);) {
>                     if (dst[i] < 0) {
>                         dst[i] = '?';
>                     }

There's some overhead doing countPositives calls, and doing it in a loop might 
provoke poor worst case performance. Im sure you can find inputs for which this 
is a win, though.

-------------

PR: https://git.openjdk.org/jdk/pull/12640

Reply via email to