Claes,

> IIUC this optimization leans on 4 long divs being slower than 1 long div + 4 
> int divs

Exactly. I think there is also some benefit from unrolling the 4 int
digit pair operations.

> which might not be true on all platforms, nor stay true in the future

Agreed, but I am not sure how to reason about that.

> Long values will in practice likely be biased towards lower values, so it’s 
> important that any optimization to .. longer values doesn’t regress inputs in 
> the int range. Since it’s all guarded by a test that is already there there 
> shouldn’t be much room for a difference, but adding code can cause 
> interesting issues so it’s always worth measuring to make sure. Have you run 
> any benchmark for inputs smaller than the threshold? And for a healthy mix of 
> values?

I had been focused on "longer" values, as I have uses which are
effectively guaranteed to have some bits from 53-63 set.
I added some tests for "int" values as well as a mix with 2 "longs"
and 3 "ints". This showed a pretty small regression for the int and
mixed case. That regression went away by changing back to a while loop
comparing to Integer.MIN_VALUE, and that did not give up much of the
gain.

private static final long[] ints = new long[] {Integer.MAX_VALUE,
12345, -5432, 654321, 5};
private static final long[] longs = new long[] {235789987235L,
235789987235326L, -98975433216803632L, Long.MAX_VALUE};
private static final long[] mixed = new long[] {5, Long.MIN_VALUE,
654321, 9876543210L, 543};


Benchmark                         (type)  Mode  Cnt   Score    Error  Units
LongToStringBenchmark.baseline       int  avgt    3  24.105 ┬▒ 11.751  ns/op
LongToStringBenchmark.baseline      long  avgt    3  51.223 ┬▒ 21.069  ns/op
LongToStringBenchmark.baseline       mix  avgt    3  34.849 ┬▒  7.270  ns/op
LongToStringBenchmark.change         int  avgt    3  25.846 ┬▒  2.012  ns/op
LongToStringBenchmark.change        long  avgt    3  43.053 ┬▒ 13.886  ns/op
LongToStringBenchmark.change         mix  avgt    3  35.826 ┬▒  2.901  ns/op
LongToStringBenchmark.changeLoop     int  avgt    3  24.261 ┬▒  7.325  ns/op
LongToStringBenchmark.changeLoop    long  avgt    3  44.254 ┬▒ 22.921  ns/op
LongToStringBenchmark.changeLoop     mix  avgt    3  29.501 ┬▒  8.693  ns/op


"change" is what I had proposed originally and "changeLoop" is:

while (i <= Integer.MIN_VALUE) {
  long q = i / 100_000_000;
  charPos -= 8;
  write4DigitPairs(buf, charPos, (int) ((q * 100_000_000) - i));
  v = q;
}

Brett


On Wed, Apr 24, 2024 at 2:39 PM Claes Redestad
<claes.redes...@oracle.com> wrote:
>
> Hi,
>
> IIUC this optimization leans on 4 long divs being slower than 1 long div + 4 
> int divs, which might not be true on all platforms, nor stay true in the 
> future. Long values will in practice likely be biased towards lower values, 
> so it’s important that any optimization to .. longer values doesn’t regress 
> inputs in the int range. Since it’s all guarded by a test that is already 
> there there shouldn’t be much room for a difference, but adding code can 
> cause interesting issues so it’s always worth measuring to make sure. Have 
> you run any benchmark for inputs smaller than the threshold? And for a 
> healthy mix of values?
>
> Thanks!
> /Claes
>
> 24 apr. 2024 kl. 21:08 skrev Brett Okken <brett.okken...@gmail.com>:
>
> Is there interest in optimizing StringLatin1.getChars(long, int, byte[]) for 
> large (larger than int) long values[1]?
> We can change this to work with 8 digits at a time, which reduces the amount 
> of 64 bit arithmetic required.
>
> if (i <= -1_000_000_000) {
> long q = i / 100_000_000;
> charPos -= 8;
> write4DigitPairs(buf, charPos, (int) ((q * 100_000_000) - i));
> i = q;
> if (i <= -1_000_000_000) {
> q = i / 100_000_000;
> charPos -= 8;
> write4DigitPairs(buf, charPos, (int) ((q * 100_000_000) - i));
> i = q;
> }
> }
>
>
> A simple implementation of write4DigitPairs would just call the existing 
> writeDigitPair method 4 times:
>
>
> private static void write4DigitPairs(byte[] buf, int idx, int value) {
> int v = value;
> int v2 = v / 100;
> writeDigitPair(buf, idx + 6, v - (v2 * 100));
> v = v2;
>
> v2 = v / 100;
> writeDigitPair(buf, idx + 4, v - (v2 * 100));
> v = v2;
>
> v2 = v / 100;
> writeDigitPair(buf, idx + 2, v - (v2 * 100));
> v = v2;
>
> v2 = v / 100;
> writeDigitPair(buf, idx, v - (v2 * 100));
> }
>
> There is the option to OR the 4 short values together into a long and 
> leverage a ByteArrayLittleEndian.setLong call, but I see that the previous 
> usage of ByteArrayLittleEndian.setShort was removed[2].
>
> A small benchmark of longs which would qualify shows up to 20% improvement.
>
> Presumably a similar change could make sense for StringUTF16, but I have not 
> spent any time benchmarking it.
>
> Brett
>
> [1] - 
> https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/StringLatin1.java#L163-L168
> [2] - 
> https://github.com/openjdk/jdk/commit/913e43fea995b746fb9e1b25587d254396c7c3c9
>
>

Reply via email to