Claes,
> IIUC this optimization leans on 4 long divs being slower than 1 long div + 4
> int divs
Exactly. I think there is also some benefit from unrolling the 4 int
digit pair operations.
> which might not be true on all platforms, nor stay true in the future
Agreed, but I am not sure how to reason about that.
> Long values will in practice likely be biased towards lower values, so it’s
> important that any optimization to .. longer values doesn’t regress inputs in
> the int range. Since it’s all guarded by a test that is already there there
> shouldn’t be much room for a difference, but adding code can cause
> interesting issues so it’s always worth measuring to make sure. Have you run
> any benchmark for inputs smaller than the threshold? And for a healthy mix of
> values?
I had been focused on "longer" values, as I have uses which are
effectively guaranteed to have some bits from 53-63 set.
I added some tests for "int" values as well as a mix with 2 "longs"
and 3 "ints". This showed a pretty small regression for the int and
mixed case. That regression went away by changing back to a while loop
comparing to Integer.MIN_VALUE, and that did not give up much of the
gain.
private static final long[] ints = new long[] {Integer.MAX_VALUE,
12345, -5432, 654321, 5};
private static final long[] longs = new long[] {235789987235L,
235789987235326L, -98975433216803632L, Long.MAX_VALUE};
private static final long[] mixed = new long[] {5, Long.MIN_VALUE,
654321, 9876543210L, 543};
Benchmark (type) Mode Cnt ScoreError Units
LongToStringBenchmark.baseline int avgt3 24.105 ┬▒ 11.751 ns/op
LongToStringBenchmark.baseline long avgt3 51.223 ┬▒ 21.069 ns/op
LongToStringBenchmark.baseline mix avgt3 34.849 ┬▒ 7.270 ns/op
LongToStringBenchmark.change int avgt3 25.846 ┬▒ 2.012 ns/op
LongToStringBenchmark.changelong avgt3 43.053 ┬▒ 13.886 ns/op
LongToStringBenchmark.change mix avgt3 35.826 ┬▒ 2.901 ns/op
LongToStringBenchmark.changeLoop int avgt3 24.261 ┬▒ 7.325 ns/op
LongToStringBenchmark.changeLooplong avgt3 44.254 ┬▒ 22.921 ns/op
LongToStringBenchmark.changeLoop mix avgt3 29.501 ┬▒ 8.693 ns/op
"change" is what I had proposed originally and "changeLoop" is:
while (i <= Integer.MIN_VALUE) {
long q = i / 100_000_000;
charPos -= 8;
write4DigitPairs(buf, charPos, (int) ((q * 100_000_000) - i));
v = q;
}
Brett
On Wed, Apr 24, 2024 at 2:39 PM Claes Redestad
wrote:
>
> Hi,
>
> IIUC this optimization leans on 4 long divs being slower than 1 long div + 4
> int divs, which might not be true on all platforms, nor stay true in the
> future. Long values will in practice likely be biased towards lower values,
> so it’s important that any optimization to .. longer values doesn’t regress
> inputs in the int range. Since it’s all guarded by a test that is already
> there there shouldn’t be much room for a difference, but adding code can
> cause interesting issues so it’s always worth measuring to make sure. Have
> you run any benchmark for inputs smaller than the threshold? And for a
> healthy mix of values?
>
> Thanks!
> /Claes
>
> 24 apr. 2024 kl. 21:08 skrev Brett Okken :
>
> Is there interest in optimizing StringLatin1.getChars(long, int, byte[]) for
> large (larger than int) long values[1]?
> We can change this to work with 8 digits at a time, which reduces the amount
> of 64 bit arithmetic required.
>
> if (i <= -1_000_000_000) {
> long q = i / 100_000_000;
> charPos -= 8;
> write4DigitPairs(buf, charPos, (int) ((q * 100_000_000) - i));
> i = q;
> if (i <= -1_000_000_000) {
> q = i / 100_000_000;
> charPos -= 8;
> write4DigitPairs(buf, charPos, (int) ((q * 100_000_000) - i));
> i = q;
> }
> }
>
>
> A simple implementation of write4DigitPairs would just call the existing
> writeDigitPair method 4 times:
>
>
> private static void write4DigitPairs(byte[] buf, int idx, int value) {
> int v = value;
> int v2 = v / 100;
> writeDigitPair(buf, idx + 6, v - (v2 * 100));
> v = v2;
>
> v2 = v / 100;
> writeDigitPair(buf, idx + 4, v - (v2 * 100));
> v = v2;
>
> v2 = v / 100;
> writeDigitPair(buf, idx + 2, v - (v2 * 100));
> v = v2;
>
> v2 = v / 100;
> writeDigitPair(buf, idx, v - (v2 * 100));
> }
>
> There is the option to OR the 4 short values together into a long and
> leverage a ByteArrayLittleEndian.setLong call, but I see that the previous
> usage of ByteArrayLittleEndian.setShort was removed[2].
>
> A small benchmark of longs which would qualify shows up to 20% improvement.
>
> Presumably a similar change could make sense for StringUTF16, but I have not
> spent any time benchmarking it.
>
> Brett
>
> [1] -
> https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/StringLatin1.java#L163-L168
> [2] -
> https://github.com/openjdk/jdk/commit/913e43fea995b746fb9e1b25587d254396c7c3c9
>
>