| Issue |
176985
|
| Summary |
[AArch64] basic uint64x2_t shift+add gets scalarized
|
| Labels |
new issue
|
| Assignees |
|
| Reporter |
dzaima
|
This code:
```c
uint64x2_t mul_257(uint64x2_t a) {
return vaddq_u64(a, vshlq_n_u64(a, 8));
}
```
at `-O3` gets scalarized to:
```asm
mul_257:
fmov x9, d0
mov x8, v0.d[1]
add x9, x9, x9, lsl #8
add x8, x8, x8, lsl #8
fmov d0, x9
mov v0.d[1], x8
ret
```
instead of the simple, direct, faster (and much much lower latency on cores with high NEON↔GPR latency):
```asm
mul_257:
shl v31.2d, v0.2d, 8
add v0.2d, v31.2d, v0.2d
ret
```
https://c.godbolt.org/z/rjrbcbvq7
Similar applies to a bunch of cases of adding/subtracting a power of two (I'd imagine even more complex multiply-emulating ops with many shifts may be worth keeping in-SIMD instead of scalarized)
_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs