On Sat, 18 Feb 2023 23:26:08 GMT, Claes Redestad <redes...@openjdk.org> wrote:

> When encoding Strings to US-ASCII we can speed up the happy path 
> significantly by using `StringCoding.countPositives` as a speculative check 
> for whether there are any chars that needs to be replaced by `'?'`. Once a 
> non-ASCII char is encountered we fall back to the slow loop and replace as 
> needed.
> 
> An alternative could be unrolling or using a byte array VarHandle, as 
> show-cased by Brett Okken here: 
> https://mail.openjdk.org/pipermail/core-libs-dev/2023-February/100573.html 
> Having to replace chars with `?` is essentially an encoding error so it might 
> be safe to assume this case is exceptional in practice.

src/java.base/share/classes/java/lang/String.java line 976:

> 974:     private static byte[] encodeASCII(byte coder, byte[] val) {
> 975:         if (coder == LATIN1) {
> 976:             byte[] dst = Arrays.copyOf(val, val.length);

Given the tweaks in https://git.openjdk.org/jdk/pull/12613 should this use 
`val.clone()` (would skip the length check)

Suggestion:

            byte[] dst = val.clone();

src/java.base/share/classes/java/lang/String.java line 982:

> 980:                     if (dst[i] < 0) {
> 981:                         dst[i] = '?';
> 982:                     }

I'm curious if using countPositives (and vectorization) to scan forward would 
be valuable for long (mostly ASCII) strings or if the method call 
overhead/non-constant stride is not a win for shorter strings or heavily 
non-ascii inputs. 

Suggestion:

                for (int i = positives; i < dst.length; i = 
StringCoding.countPositives(dst, i + 1, dst.length - i);) {
                    if (dst[i] < 0) {
                        dst[i] = '?';
                    }

-------------

PR: https://git.openjdk.org/jdk/pull/12640

Reply via email to