Claes, > IIUC this optimization leans on 4 long divs being slower than 1 long div + 4 > int divs
Exactly. I think there is also some benefit from unrolling the 4 int digit pair operations. > which might not be true on all platforms, nor stay true in the future Agreed, but I am not sure how to reason about that. > Long values will in practice likely be biased towards lower values, so it’s > important that any optimization to .. longer values doesn’t regress inputs in > the int range. Since it’s all guarded by a test that is already there there > shouldn’t be much room for a difference, but adding code can cause > interesting issues so it’s always worth measuring to make sure. Have you run > any benchmark for inputs smaller than the threshold? And for a healthy mix of > values? I had been focused on "longer" values, as I have uses which are effectively guaranteed to have some bits from 53-63 set. I added some tests for "int" values as well as a mix with 2 "longs" and 3 "ints". This showed a pretty small regression for the int and mixed case. That regression went away by changing back to a while loop comparing to Integer.MIN_VALUE, and that did not give up much of the gain. private static final long[] ints = new long[] {Integer.MAX_VALUE, 12345, -5432, 654321, 5}; private static final long[] longs = new long[] {235789987235L, 235789987235326L, -98975433216803632L, Long.MAX_VALUE}; private static final long[] mixed = new long[] {5, Long.MIN_VALUE, 654321, 9876543210L, 543}; Benchmark (type) Mode Cnt Score Error Units LongToStringBenchmark.baseline int avgt 3 24.105 ┬▒ 11.751 ns/op LongToStringBenchmark.baseline long avgt 3 51.223 ┬▒ 21.069 ns/op LongToStringBenchmark.baseline mix avgt 3 34.849 ┬▒ 7.270 ns/op LongToStringBenchmark.change int avgt 3 25.846 ┬▒ 2.012 ns/op LongToStringBenchmark.change long avgt 3 43.053 ┬▒ 13.886 ns/op LongToStringBenchmark.change mix avgt 3 35.826 ┬▒ 2.901 ns/op LongToStringBenchmark.changeLoop int avgt 3 24.261 ┬▒ 7.325 ns/op LongToStringBenchmark.changeLoop long avgt 3 44.254 ┬▒ 22.921 ns/op LongToStringBenchmark.changeLoop mix avgt 3 29.501 ┬▒ 8.693 ns/op "change" is what I had proposed originally and "changeLoop" is: while (i <= Integer.MIN_VALUE) { long q = i / 100_000_000; charPos -= 8; write4DigitPairs(buf, charPos, (int) ((q * 100_000_000) - i)); v = q; } Brett On Wed, Apr 24, 2024 at 2:39 PM Claes Redestad <claes.redes...@oracle.com> wrote: > > Hi, > > IIUC this optimization leans on 4 long divs being slower than 1 long div + 4 > int divs, which might not be true on all platforms, nor stay true in the > future. Long values will in practice likely be biased towards lower values, > so it’s important that any optimization to .. longer values doesn’t regress > inputs in the int range. Since it’s all guarded by a test that is already > there there shouldn’t be much room for a difference, but adding code can > cause interesting issues so it’s always worth measuring to make sure. Have > you run any benchmark for inputs smaller than the threshold? And for a > healthy mix of values? > > Thanks! > /Claes > > 24 apr. 2024 kl. 21:08 skrev Brett Okken <brett.okken...@gmail.com>: > > Is there interest in optimizing StringLatin1.getChars(long, int, byte[]) for > large (larger than int) long values[1]? > We can change this to work with 8 digits at a time, which reduces the amount > of 64 bit arithmetic required. > > if (i <= -1_000_000_000) { > long q = i / 100_000_000; > charPos -= 8; > write4DigitPairs(buf, charPos, (int) ((q * 100_000_000) - i)); > i = q; > if (i <= -1_000_000_000) { > q = i / 100_000_000; > charPos -= 8; > write4DigitPairs(buf, charPos, (int) ((q * 100_000_000) - i)); > i = q; > } > } > > > A simple implementation of write4DigitPairs would just call the existing > writeDigitPair method 4 times: > > > private static void write4DigitPairs(byte[] buf, int idx, int value) { > int v = value; > int v2 = v / 100; > writeDigitPair(buf, idx + 6, v - (v2 * 100)); > v = v2; > > v2 = v / 100; > writeDigitPair(buf, idx + 4, v - (v2 * 100)); > v = v2; > > v2 = v / 100; > writeDigitPair(buf, idx + 2, v - (v2 * 100)); > v = v2; > > v2 = v / 100; > writeDigitPair(buf, idx, v - (v2 * 100)); > } > > There is the option to OR the 4 short values together into a long and > leverage a ByteArrayLittleEndian.setLong call, but I see that the previous > usage of ByteArrayLittleEndian.setShort was removed[2]. > > A small benchmark of longs which would qualify shows up to 20% improvement. > > Presumably a similar change could make sense for StringUTF16, but I have not > spent any time benchmarking it. > > Brett > > [1] - > https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/StringLatin1.java#L163-L168 > [2] - > https://github.com/openjdk/jdk/commit/913e43fea995b746fb9e1b25587d254396c7c3c9 > >