Re: large longs to string

2024-04-24 Thread Brett Okken
Claes,

> IIUC this optimization leans on 4 long divs being slower than 1 long div + 4 
> int divs

Exactly. I think there is also some benefit from unrolling the 4 int
digit pair operations.

> which might not be true on all platforms, nor stay true in the future

Agreed, but I am not sure how to reason about that.

> Long values will in practice likely be biased towards lower values, so it’s 
> important that any optimization to .. longer values doesn’t regress inputs in 
> the int range. Since it’s all guarded by a test that is already there there 
> shouldn’t be much room for a difference, but adding code can cause 
> interesting issues so it’s always worth measuring to make sure. Have you run 
> any benchmark for inputs smaller than the threshold? And for a healthy mix of 
> values?

I had been focused on "longer" values, as I have uses which are
effectively guaranteed to have some bits from 53-63 set.
I added some tests for "int" values as well as a mix with 2 "longs"
and 3 "ints". This showed a pretty small regression for the int and
mixed case. That regression went away by changing back to a while loop
comparing to Integer.MIN_VALUE, and that did not give up much of the
gain.

private static final long[] ints = new long[] {Integer.MAX_VALUE,
12345, -5432, 654321, 5};
private static final long[] longs = new long[] {235789987235L,
235789987235326L, -98975433216803632L, Long.MAX_VALUE};
private static final long[] mixed = new long[] {5, Long.MIN_VALUE,
654321, 9876543210L, 543};


Benchmark (type)  Mode  Cnt   ScoreError  Units
LongToStringBenchmark.baseline   int  avgt3  24.105 ┬▒ 11.751  ns/op
LongToStringBenchmark.baseline  long  avgt3  51.223 ┬▒ 21.069  ns/op
LongToStringBenchmark.baseline   mix  avgt3  34.849 ┬▒  7.270  ns/op
LongToStringBenchmark.change int  avgt3  25.846 ┬▒  2.012  ns/op
LongToStringBenchmark.changelong  avgt3  43.053 ┬▒ 13.886  ns/op
LongToStringBenchmark.change mix  avgt3  35.826 ┬▒  2.901  ns/op
LongToStringBenchmark.changeLoop int  avgt3  24.261 ┬▒  7.325  ns/op
LongToStringBenchmark.changeLooplong  avgt3  44.254 ┬▒ 22.921  ns/op
LongToStringBenchmark.changeLoop mix  avgt3  29.501 ┬▒  8.693  ns/op


"change" is what I had proposed originally and "changeLoop" is:

while (i <= Integer.MIN_VALUE) {
  long q = i / 100_000_000;
  charPos -= 8;
  write4DigitPairs(buf, charPos, (int) ((q * 100_000_000) - i));
  v = q;
}

Brett


On Wed, Apr 24, 2024 at 2:39 PM Claes Redestad
 wrote:
>
> Hi,
>
> IIUC this optimization leans on 4 long divs being slower than 1 long div + 4 
> int divs, which might not be true on all platforms, nor stay true in the 
> future. Long values will in practice likely be biased towards lower values, 
> so it’s important that any optimization to .. longer values doesn’t regress 
> inputs in the int range. Since it’s all guarded by a test that is already 
> there there shouldn’t be much room for a difference, but adding code can 
> cause interesting issues so it’s always worth measuring to make sure. Have 
> you run any benchmark for inputs smaller than the threshold? And for a 
> healthy mix of values?
>
> Thanks!
> /Claes
>
> 24 apr. 2024 kl. 21:08 skrev Brett Okken :
>
> Is there interest in optimizing StringLatin1.getChars(long, int, byte[]) for 
> large (larger than int) long values[1]?
> We can change this to work with 8 digits at a time, which reduces the amount 
> of 64 bit arithmetic required.
>
> if (i <= -1_000_000_000) {
> long q = i / 100_000_000;
> charPos -= 8;
> write4DigitPairs(buf, charPos, (int) ((q * 100_000_000) - i));
> i = q;
> if (i <= -1_000_000_000) {
> q = i / 100_000_000;
> charPos -= 8;
> write4DigitPairs(buf, charPos, (int) ((q * 100_000_000) - i));
> i = q;
> }
> }
>
>
> A simple implementation of write4DigitPairs would just call the existing 
> writeDigitPair method 4 times:
>
>
> private static void write4DigitPairs(byte[] buf, int idx, int value) {
> int v = value;
> int v2 = v / 100;
> writeDigitPair(buf, idx + 6, v - (v2 * 100));
> v = v2;
>
> v2 = v / 100;
> writeDigitPair(buf, idx + 4, v - (v2 * 100));
> v = v2;
>
> v2 = v / 100;
> writeDigitPair(buf, idx + 2, v - (v2 * 100));
> v = v2;
>
> v2 = v / 100;
> writeDigitPair(buf, idx, v - (v2 * 100));
> }
>
> There is the option to OR the 4 short values together into a long and 
> leverage a ByteArrayLittleEndian.setLong call, but I see that the previous 
> usage of ByteArrayLittleEndian.setShort was removed[2].
>
> A small benchmark of longs which would qualify shows up to 20% improvement.
>
> Presumably a similar change could make sense for StringUTF16, but I have not 
> spent any time benchmarking it.
>
> Brett
>
> [1] - 
> https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/StringLatin1.java#L163-L168
> [2] - 
> https://github.com/openjdk/jdk/commit/913e43fea995b746fb9e1b25587d254396c7c3c9
>
>


Re: large longs to string

2024-04-24 Thread Claes Redestad
Hi,

IIUC this optimization leans on 4 long divs being slower than 1 long div + 4 
int divs, which might not be true on all platforms, nor stay true in the 
future. Long values will in practice likely be biased towards lower values, so 
it’s important that any optimization to .. longer values doesn’t regress inputs 
in the int range. Since it’s all guarded by a test that is already there there 
shouldn’t be much room for a difference, but adding code can cause interesting 
issues so it’s always worth measuring to make sure. Have you run any benchmark 
for inputs smaller than the threshold? And for a healthy mix of values?

Thanks!
/Claes

24 apr. 2024 kl. 21:08 skrev Brett Okken :

Is there interest in optimizing StringLatin1.getChars(long, int, byte[]) for 
large (larger than int) long values[1]?
We can change this to work with 8 digits at a time, which reduces the amount of 
64 bit arithmetic required.

if (i <= -1_000_000_000) {
long q = i / 100_000_000;
charPos -= 8;
write4DigitPairs(buf, charPos, (int) ((q * 100_000_000) - i));
i = q;
if (i <= -1_000_000_000) {
q = i / 100_000_000;
charPos -= 8;
write4DigitPairs(buf, charPos, (int) ((q * 100_000_000) - i));
i = q;
}
}


A simple implementation of write4DigitPairs would just call the existing 
writeDigitPair method 4 times:


private static void write4DigitPairs(byte[] buf, int idx, int value) {
int v = value;
int v2 = v / 100;
writeDigitPair(buf, idx + 6, v - (v2 * 100));
v = v2;

v2 = v / 100;
writeDigitPair(buf, idx + 4, v - (v2 * 100));
v = v2;

v2 = v / 100;
writeDigitPair(buf, idx + 2, v - (v2 * 100));
v = v2;

v2 = v / 100;
writeDigitPair(buf, idx, v - (v2 * 100));
}

There is the option to OR the 4 short values together into a long and leverage 
a ByteArrayLittleEndian.setLong call, but I see that the previous usage of 
ByteArrayLittleEndian.setShort was removed[2].

A small benchmark of longs which would qualify shows up to 20% improvement.

Presumably a similar change could make sense for StringUTF16, but I have not 
spent any time benchmarking it.

Brett

[1] - 
https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/StringLatin1.java#L163-L168
[2] - 
https://github.com/openjdk/jdk/commit/913e43fea995b746fb9e1b25587d254396c7c3c9



large longs to string

2024-04-24 Thread Brett Okken
Is there interest in optimizing StringLatin1.getChars(long, int, byte[])
for large (larger than int) long values[1]?
We can change this to work with 8 digits at a time, which reduces the
amount of 64 bit arithmetic required.

if (i <= -1_000_000_000) {

long q = i / 100_000_000;

charPos -= 8;

write4DigitPairs(buf, charPos, (int) ((q * 100_000_000) - i));

i = q;

if (i <= -1_000_000_000) {

q = i / 100_000_000;

charPos -= 8;

write4DigitPairs(buf, charPos, (int) ((q * 100_000_000) - i));

i = q;

}

}


A simple implementation of write4DigitPairs would just call the existing
writeDigitPair method 4 times:


private static void write4DigitPairs(byte[] buf, int idx, int value) {

int v = value;

int v2 = v / 100;

writeDigitPair(buf, idx + 6, v - (v2 * 100));

v = v2;


v2 = v / 100;

writeDigitPair(buf, idx + 4, v - (v2 * 100));

v = v2;


v2 = v / 100;

writeDigitPair(buf, idx + 2, v - (v2 * 100));

v = v2;


v2 = v / 100;

writeDigitPair(buf, idx, v - (v2 * 100));

}

There is the option to OR the 4 short values together into a long and
leverage a ByteArrayLittleEndian.setLong call, but I see that the previous
usage of ByteArrayLittleEndian.setShort was removed[2].

A small benchmark of longs which would qualify shows up to 20% improvement.

Presumably a similar change could make sense for StringUTF16, but I have
not spent any time benchmarking it.

Brett

[1] -
https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/StringLatin1.java#L163-L168
[2] -
https://github.com/openjdk/jdk/commit/913e43fea995b746fb9e1b25587d254396c7c3c9