On Mon, 17 Feb 2025 15:02:32 GMT, Roland Westrelin <[email protected]> wrote:
>> @rwestrel @galderz
>>
>>> It seems overall, we likely win more than we loose with this intrinsic, so
>>> I would integrate this change as it is and file a bug to keep track of
>>> remaining issues.
>>
>> I'm a little scared to just accept the regressions, especially for this
>> "most average looking case":
>> Imagine you have an array with random numbers. Or at least numbers in a
>> random order. If we take the max, then we expect the first number to be max
>> with probability 1, the second 1/2, the third 1/3, the i'th 1/i. So the
>> average branch probability is `n / (sum_i 1/i)`. This goes closer and closer
>> to zero, the larger the array. This means that the "average" case has an
>> extreme probability. And so if we do not vectorize, then this gets us a
>> regression with the current patch. And vectorization is a little fragile, it
>> only takes very little for vectorization not to kick in.
>>
>>> The Min/Max nodes are floating nodes. They can hoist out of loop and common
>>> reliably in ways that are not guaranteed otherwise.
>>
>> I suppose we could write an optimization that can hoist out loop independent
>> if-diamonds out of a loop. If the condition and all phi inputs are loop
>> invariant, you could just cut the diamond out of the loop, and paste it
>> before the loop entry.
>>
>>> Shouldn't int min/max be affected the same way?
>>
>> I think we should be able to see the same issue here, actually. Yes. Here a
>> quick benchmark below:
>>
>>
>> java -XX:CompileCommand=compileonly,TestIntMax::test*
>> -XX:CompileCommand=printcompilation,TestIntMax::test* -XX:+TraceNewVectors
>> TestIntMax.java
>> CompileCommand: compileonly TestIntMax.test* bool compileonly = true
>> CompileCommand: PrintCompilation TestIntMax.test* bool PrintCompilation =
>> true
>> Warmup
>> 5225 93 % 3 TestIntMax::test1 @ 5 (27 bytes)
>> 5226 94 3 TestIntMax::test1 (27 bytes)
>> 5226 95 % 4 TestIntMax::test1 @ 5 (27 bytes)
>> 5238 96 4 TestIntMax::test1 (27 bytes)
>> Run
>> Time: 542056319
>> Warmup
>> 6320 101 % 3 TestIntMax::test2 @ 5 (34 bytes)
>> 6322 102 % 4 TestIntMax::test2 @ 5 (34 bytes)
>> 6329 103 4 TestIntMax::test2 (34 bytes)
>> Run
>> Time: 166815209
>>
>> That's a 4x regression on random input data!
>>
>> With:
>>
>> import java.util.Random;
>>
>> public class TestIntMax {
>> private static Random RANDOM = new Random();
>>
>> public static void main(String[] args) {
>> int[] a = new int[64 * 1024];
>> for (int i = 0; i < a.length; i++) {
>>...
>
>> I think we should be able to see the same issue here, actually. Yes. Here a
>> quick benchmark below:
>
> I observe the same:
>
>
> Warmup
> 751 3 b TestIntMax::test1 (27 bytes)
> Run
> Time: 360 550 158
> Warmup
> 1862 15 b TestIntMax::test2 (34 bytes)
> Run
> Time: 92 116 170
>
>
> But then with this:
>
>
> diff --git a/src/hotspot/cpu/x86/x86_64.ad b/src/hotspot/cpu/x86/x86_64.ad
> index 8cc4a970bfd..9abda8f4178 100644
> --- a/src/hotspot/cpu/x86/x86_64.ad
> +++ b/src/hotspot/cpu/x86/x86_64.ad
> @@ -12037,16 +12037,20 @@ instruct cmovI_reg_l(rRegI dst, rRegI src,
> rFlagsReg cr)
> %}
>
>
> -instruct maxI_rReg(rRegI dst, rRegI src)
> +instruct maxI_rReg(rRegI dst, rRegI src, rFlagsReg cr)
> %{
> match(Set dst (MaxI dst src));
> + effect(KILL cr);
>
> ins_cost(200);
> - expand %{
> - rFlagsReg cr;
> - compI_rReg(cr, dst, src);
> - cmovI_reg_l(dst, src, cr);
> + ins_encode %{
> + Label done;
> + __ cmpl($src$$Register, $dst$$Register);
> + __ jccb(Assembler::less, done);
> + __ mov($dst$$Register, $src$$Register);
> + __ bind(done);
> %}
> + ins_pipe(pipe_cmov_reg);
> %}
>
> //
> ============================================================================
>
>
> the performance gap narrows:
>
>
> Warmup
> 770 3 b TestIntMax::test1 (27 bytes)
> Run
> Time: 94 951 677
> Warmup
> 1312 15 b TestIntMax::test2 (34 bytes)
> Run
> Time: 70 053 824
>
>
> (the number of test2 fluctuates quite a bit). Does it ever make sense to
> implement `MaxI` with a conditional move then?
@rwestrel @eme64 I think that the data distribution in the `TestIntMax` above
matters (see my explanations in
https://github.com/openjdk/jdk/pull/20098#issuecomment-2642788364), so I've
enhanced the test to control data distribution in the int[] (see at the bottom).
Here are the results I see on my AVX-512 machine:
Probability: 50%
Warmup
7834 92 % b 3 TestIntMax::test1 @ 5 (27 bytes)
7836 93 b 3 TestIntMax::test1 (27 bytes)
7838 94 % b 4 TestIntMax::test1 @ 5 (27 bytes)
7851 95 b 4 TestIntMax::test1 (27 bytes)
Run
Time: 699 923 014
Warmup
9272 96 % b 3 TestIntMax::test2 @ 5 (34 bytes)
9274 97 b 3 TestIntMax::test2 (34 bytes)
9275 98 % b 4 TestIntMax::test2 @ 5 (34 bytes)
9287 99 b 4 TestIntMax::test2 (34 bytes)
Run
Time: 699 815 792
Probability: 80%
Warmup
7872 92 % b 3 TestIntMax::test1 @ 5 (27 bytes)
7874 93 b 3 TestIntMax::test1 (27 bytes)
7875 94 % b 4 TestIntMax::test1 @ 5 (27 bytes)
7889 95 b 4 TestIntMax::test1 (27 bytes)
Run
Time: 699 947 633
Warmup
9310 96 % b 3 TestIntMax::test2 @ 5 (34 bytes)
9311 97 b 3 TestIntMax::test2 (34 bytes)
9312 98 % b 4 TestIntMax::test2 @ 5 (34 bytes)
9325 99 b 4 TestIntMax::test2 (34 bytes)
Run
Time: 699 827 882
Probability: 100%
Warmup
7884 92 % b 3 TestIntMax::test1 @ 5 (27 bytes)
7886 93 b 3 TestIntMax::test1 (27 bytes)
7888 94 % b 4 TestIntMax::test1 @ 5 (27 bytes)
7901 95 b 4 TestIntMax::test1 (27 bytes)
Run
Time: 699 931 243
Warmup
9322 96 % b 3 TestIntMax::test2 @ 5 (34 bytes)
9323 97 b 3 TestIntMax::test2 (34 bytes)
9324 98 % b 4 TestIntMax::test2 @ 5 (34 bytes)
9336 99 b 4 TestIntMax::test2 (34 bytes)
Run
Time: 1 077 937 282
import java.util.Random;
import java.util.concurrent.ThreadLocalRandom;
import java.text.DecimalFormat;
import java.text.DecimalFormatSymbols;
class TestIntMax
{
static final int RANGE = 16 * 1024;
static final int ITER = 100_000;
public static void main(String[] args)
{
final int probability = Integer.parseInt(args[0]);
final DecimalFormatSymbols symbols = new DecimalFormatSymbols();
symbols.setGroupingSeparator(' ');
final DecimalFormat format = new DecimalFormat("#,###", symbols);
System.out.printf("Probability: %d%%%n", probability);
int[] a = new int[64 * 1024];
init(a, probability);
{
System.out.println("Warmup");
for (int i = 0; i < 10_000; i++)
{
test1(a);
}
System.out.println("Run");
long t0 = System.nanoTime();
for (int i = 0; i < 10_000; i++)
{
test1(a);
}
long t1 = System.nanoTime();
System.out.println("Time: " + format.format(t1 - t0));
}
{
System.out.println("Warmup");
for (int i = 0; i < 10_000; i++)
{
test2(a);
}
System.out.println("Run");
long t0 = System.nanoTime();
for (int i = 0; i < 10_000; i++)
{
test2(a);
}
long t1 = System.nanoTime();
System.out.println("Time: " + format.format(t1 - t0));
}
}
public static int test1(int[] a)
{
int x = Integer.MIN_VALUE;
for (int i = 0; i < a.length; i++)
{
x = Math.max(x, a[i]);
}
return x;
}
public static int test2(int[] a)
{
int x = Integer.MIN_VALUE;
for (int i = 0; i < a.length; i++)
{
x = (x >= a[i]) ? x : a[i];
}
return x;
}
public static void init(int[] ints, int probability)
{
int aboveCount, abovePercent;
do
{
int max = ThreadLocalRandom.current().nextInt(10);
ints[0] = max;
aboveCount = 0;
for (int i = 1; i < ints.length; i++)
{
int value;
if (ThreadLocalRandom.current().nextInt(101) <= probability)
{
int increment = ThreadLocalRandom.current().nextInt(10);
value = max + increment;
aboveCount++;
}
else
{
// Decrement by at least 1
int decrement = ThreadLocalRandom.current().nextInt(10) + 1;
value = max - decrement;
}
ints[i] = value;
max = Math.max(max, value);
}
abovePercent = ((aboveCount + 1) * 100) / ints.length;
} while (abovePercent != probability);
}
}
Focusing my comment below on 100% which is where the differences appear:
test2 (100%):
;; B12: # out( B21 B13 ) <- in( B11 B20 ) Freq: 1.6744e+09
0x00007f15bcada2e9: movl 0x14(%rsi, %rdx, 4), %r11d
;*iaload
{reexecute=0 rethrow=0 return_oop=0}
; -
TestIntMax::test2@14 (line 71)
0x00007f15bcada2ee: cmpl %r11d, %r10d
0x00007f15bcada2f1: jge 0x7f15bcada362 ;*istore_1
{reexecute=0 rethrow=0 return_oop=0}
; -
TestIntMax::test2@25 (line 71)
test1 (100%)
;; B10: # out( B10 B11 ) <- in( B9 B10 ) Loop( B10-B10 inner main of N64
strip mined) Freq: 1.6744e+09
0x00007f15bcad9a70: movl 0x4c(%rsi, %rdx, 4), %r11d
0x00007f15bcad9a75: movl %r11d, (%rsp)
0x00007f15bcad9a79: movl 0x48(%rsi, %rdx, 4), %r10d
0x00007f15bcad9a7e: movl %r10d, 4(%rsp)
0x00007f15bcad9a83: movl 0x10(%rsi, %rdx, 4), %r11d
0x00007f15bcad9a88: movl 0x14(%rsi, %rdx, 4), %r9d
0x00007f15bcad9a8d: movl 0x44(%rsi, %rdx, 4), %r10d
0x00007f15bcad9a92: movl %r10d, 8(%rsp)
0x00007f15bcad9a97: movl 0x18(%rsi, %rdx, 4), %r8d
0x00007f15bcad9a9c: cmpl %r11d, %eax
0x00007f15bcad9a9f: cmovll %r11d, %eax
0x00007f15bcad9aa3: cmpl %r9d, %eax
0x00007f15bcad9aa6: cmovll %r9d, %eax
0x00007f15bcad9aaa: movl 0x20(%rsi, %rdx, 4), %r10d
0x00007f15bcad9aaf: cmpl %r8d, %eax
0x00007f15bcad9ab2: cmovll %r8d, %eax
0x00007f15bcad9ab6: movl 0x24(%rsi, %rdx, 4), %r8d
0x00007f15bcad9abb: movl 0x28(%rsi, %rdx, 4), %r11d
; {no_reloc}
0x00007f15bcad9ac0: movl 0x2c(%rsi, %rdx, 4), %ecx
0x00007f15bcad9ac4: movl 0x30(%rsi, %rdx, 4), %r9d
0x00007f15bcad9ac9: movl 0x34(%rsi, %rdx, 4), %edi
0x00007f15bcad9acd: movl 0x38(%rsi, %rdx, 4), %ebx
0x00007f15bcad9ad1: movl 0x3c(%rsi, %rdx, 4), %ebp
0x00007f15bcad9ad5: movl 0x40(%rsi, %rdx, 4), %r13d
0x00007f15bcad9ada: movl 0x1c(%rsi, %rdx, 4), %r14d
0x00007f15bcad9adf: cmpl %r14d, %eax
0x00007f15bcad9ae2: cmovll %r14d, %eax
0x00007f15bcad9ae6: cmpl %r10d, %eax
0x00007f15bcad9ae9: cmovll %r10d, %eax
0x00007f15bcad9aed: cmpl %r8d, %eax
0x00007f15bcad9af0: cmovll %r8d, %eax
0x00007f15bcad9af4: cmpl %r11d, %eax
0x00007f15bcad9af7: cmovll %r11d, %eax
0x00007f15bcad9afb: cmpl %ecx, %eax
0x00007f15bcad9afd: cmovll %ecx, %eax
0x00007f15bcad9b00: cmpl %r9d, %eax
0x00007f15bcad9b03: cmovll %r9d, %eax
0x00007f15bcad9b07: cmpl %edi, %eax
0x00007f15bcad9b09: cmovll %edi, %eax
0x00007f15bcad9b0c: cmpl %ebx, %eax
0x00007f15bcad9b0e: cmovll %ebx, %eax
0x00007f15bcad9b11: cmpl %ebp, %eax
0x00007f15bcad9b13: cmovll %ebp, %eax
0x00007f15bcad9b16: cmpl %r13d, %eax
0x00007f15bcad9b19: cmovll %r13d, %eax
0x00007f15bcad9b1d: cmpl 8(%rsp), %eax
0x00007f15bcad9b21: movl 8(%rsp), %r11d
0x00007f15bcad9b26: cmovll %r11d, %eax
0x00007f15bcad9b2a: cmpl 4(%rsp), %eax
0x00007f15bcad9b2e: movl 4(%rsp), %r10d
0x00007f15bcad9b33: cmovll %r10d, %eax
0x00007f15bcad9b37: cmpl (%rsp), %eax
0x00007f15bcad9b3a: movl (%rsp), %r11d
0x00007f15bcad9b3e: cmovll %r11d, %eax ;*invokestatic max
{reexecute=0 rethrow=0 return_oop=0}
; -
TestIntMax::test1@15 (line 61)
-------------
PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2663633050