On Sat, 18 Feb 2023 23:26:08 GMT, Claes Redestad <redes...@openjdk.org> wrote:
> When encoding Strings to US-ASCII we can speed up the happy path > significantly by using `StringCoding.countPositives` as a speculative check > for whether there are any chars that needs to be replaced by `'?'`. Once a > non-ASCII char is encountered we fall back to the slow loop and replace as > needed. > > An alternative could be unrolling or using a byte array VarHandle, as > show-cased by Brett Okken here: > https://mail.openjdk.org/pipermail/core-libs-dev/2023-February/100573.html > Having to replace chars with `?` is essentially an encoding error so it might > be safe to assume this case is exceptional in practice. src/java.base/share/classes/java/lang/String.java line 976: > 974: private static byte[] encodeASCII(byte coder, byte[] val) { > 975: if (coder == LATIN1) { > 976: byte[] dst = Arrays.copyOf(val, val.length); Given the tweaks in https://git.openjdk.org/jdk/pull/12613 should this use `val.clone()` (would skip the length check) Suggestion: byte[] dst = val.clone(); src/java.base/share/classes/java/lang/String.java line 982: > 980: if (dst[i] < 0) { > 981: dst[i] = '?'; > 982: } I'm curious if using countPositives (and vectorization) to scan forward would be valuable for long (mostly ASCII) strings or if the method call overhead/non-constant stride is not a win for shorter strings or heavily non-ascii inputs. Suggestion: for (int i = positives; i < dst.length; i = StringCoding.countPositives(dst, i + 1, dst.length - i);) { if (dst[i] < 0) { dst[i] = '?'; } ------------- PR: https://git.openjdk.org/jdk/pull/12640