Issue 178703
Summary Cost and register usage in some partial reductions doesn't match the generated instructions
Labels backend:AArch64, vectorizers
Assignees
Reporter john-brawn-arm
    If we consider these two functions:
```
int test1(int n, char *a, char *b) {
 int accum = 0;
  for (int i = 0; i < n; i++) {
    accum += a[i] * b[i];
 }
  return accum;
}

int test2(int n, char *a, char *b) {
  int accum = 0;
  for (int i = 0; i < n; i++) {
    accum -= a[i] * b[i];
  }
  return accum;
}
```
Compiling with ``clang --target=aarch64-none-elf -mcpu=neoverse-v1 -O3 -mllvm -force-vector-interleave=1`` the vector loops generated for these are
```
test1:
.LBB0_7: // %vector.body     
                                        // =>This Inner Loop Header: Depth=1  
        ldr     q1, [x12], #16 
        ldr     q2, [x13], #16 
        subs    x14, x14, #16 
        udot    v0.4s, v2.16b, v1.16b                                                                
 b.ne    .LBB0_7   

test2:
.LBB1_7:                                // %vector.body                                              
 // =>This Inner Loop Header: Depth=1 
        ldr     q1, [x12], #16 
        ldr     q2, [x13], #16     
 subs    x14, x14, #16       
        umull2  v3.8h, v2.16b, v1.16b 
        umull   v1.8h, v2.8b, v1.8b 
        usubw   v0.4s, v0.4s, v1.4h
        usubw2  v0.4s, v0.4s, v1.8h 
        usubw v0.4s, v0.4s, v3.4h 
        usubw2  v0.4s, v0.4s, v3.8h 
        b.ne    .LBB1_7    
```
If you look at what's going on in the vectorizer with ``-mllvm -debug`` then it says
```
LV: Checking a loop in 'test1' from tmp.c
Cost of 1 for VF 16: _expression_ vp<%9> = ir<%accum.09> + partial.reduce.add (mul nuw nsw (ir<%1> zext to i32), (ir<%0> zext to i32))
LV(REG): Calculating max register usage:
LV(REG): Scaled down VF from 8 to 2 for _expression_ vp<%9> = ir<%accum.09> + partial.reduce.add (mul nuw nsw (ir<%1> zext to i32), (ir<%0> zext to i32))
LV(REG): Scaled down VF from 16 to 4 for _expression_ vp<%9> = ir<%accum.09> + partial.reduce.add (mul nuw nsw (ir<%1> zext to i32), (ir<%0> zext to i32))

LV: Checking a loop in 'test2' from tmp.c
Cost of 1 for VF 16: _expression_ vp<%9> = ir<%accum.09> + partial.reduce.add (sub (0, mul nuw nsw (ir<%1> zext to i32), (ir<%0> zext to i32)))
LV(REG): Calculating max register usage:
LV(REG): Scaled down VF from 8 to 2 for _expression_ vp<%9> = ir<%accum.09> + partial.reduce.add (sub (0, mul nuw nsw (ir<%1> zext to i32), (ir<%0> zext to i32)))
LV(REG): Scaled down VF from 16 to 4 for _expression_ vp<%9> = ir<%accum.09> + partial.reduce.add (sub (0, mul nuw nsw (ir<%1> zext to i32), (ir<%0> zext to i32)))
```
It thinks that the cost of both reductions is the same, and that the relevant type in both for calculating register usage is v4i32 (which is the result type) so the register usage is 1. In test1 the reduction becomes a single udot instruction so this looks correct, but in test2 we get a sequence of 6 instructions and use one extra registers due to using an intermediate v16xi16 which gets split into two v8xi16. So both the cost and register usage are wrong.

_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

Reply via email to