Issue 176985
Summary [AArch64] basic uint64x2_t shift+add gets scalarized
Labels new issue
Assignees
Reporter dzaima
    This code:
```c
uint64x2_t mul_257(uint64x2_t a) {
  return vaddq_u64(a, vshlq_n_u64(a, 8));
}
```
at `-O3` gets scalarized to:
```asm
mul_257:
        fmov    x9, d0
        mov     x8, v0.d[1]
        add     x9, x9, x9, lsl #8
        add     x8, x8, x8, lsl #8
        fmov    d0, x9
        mov     v0.d[1], x8
        ret
```
instead of the simple, direct, faster (and much much lower latency on cores with high NEON↔GPR latency):
```asm
mul_257:
        shl     v31.2d, v0.2d, 8
        add     v0.2d, v31.2d, v0.2d
        ret
```

https://c.godbolt.org/z/rjrbcbvq7

Similar applies to a bunch of cases of adding/subtracting a power of two (I'd imagine even more complex multiply-emulating ops with many shifts may be worth keeping in-SIMD instead of scalarized)
_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

Reply via email to