| Issue |
169702
|
| Summary |
NeonNonConstStrideOverhead is too high
|
| Labels |
new issue
|
| Assignees |
|
| Reporter |
mwlon
|
I have a small piece of extremely performance-sensitive decompression code ([godbolt with code](https://godbolt.org/z/fof18KM8E)) that should vectorize for optimal performance but does not on aarch64+neon. I have manually vectorized it and gotten substantially improved performance; and with some hacks that I can't apply generally I found a way to make LLVM vectorize it in on case, and also observed improved performance. I believe this code should clearly be vectorized by LLVM.
To debug, I built Rust with LLVM debug on and observed how it was reasoning about the cost of vectorization:
```
LV: Found an estimated cost of 0 for VF 1 For instruction: %22 = phi i64 [ 0, %17 ], [ %23, %21 ]
LV: Found an estimated cost of 1 for VF 1 For instruction: %23 = add nuw nsw i64 %22, 1
LV: Found an estimated cost of 0 for VF 1 For instruction: %24 = getelementptr inbounds [0 x i32], ptr %3, i64 0, i64 %22
LV: Found an estimated cost of 2 for VF 1 For instruction: %25 = load i32, ptr %24, align 4, !noundef !3
LV: Found an estimated cost of 1 for VF 1 For instruction: %26 = add i32 %25, %18
LV: Found an estimated cost of 1 for VF 1 For instruction: %27 = lshr i32 %26, 3
LV: Found an estimated cost of 0 for VF 1 For instruction: %28 = zext nneg i32 %27 to i64
LV: Found an estimated cost of 0 for VF 1 For instruction: %29 = getelementptr inbounds i8, ptr %1, i64 %28
LV: Found an estimated cost of 2 for VF 1 For instruction: %30 = load i64, ptr %29, align 1
LV: Found an estimated cost of 1 for VF 1 For instruction: %31 = and i32 %26, 7
LV: Found an estimated cost of 0 for VF 1 For instruction: %32 = zext nneg i32 %31 to i64
LV: Found an estimated cost of 1 for VF 1 For instruction: %33 = lshr i64 %30, %32
LV: Found an estimated cost of 0 for VF 1 For instruction: %34 = getelementptr inbounds [0 x i32], ptr %5, i64 0, i64 %22
LV: Found an estimated cost of 2 for VF 1 For instruction: %35 = load i32, ptr %34, align 4, !noundef !3
LV: Found an estimated cost of 1 for VF 1 For instruction: %36 = and i32 %35, 63
LV: Found an estimated cost of 0 for VF 1 For instruction: %37 = zext nneg i32 %36 to i64
LV: Found an estimated cost of 1 for VF 1 For instruction: %38 = shl nsw i64 -1, %37
LV: Found an estimated cost of 1 for VF 1 For instruction: %39 = xor i64 %38, -1
LV: Found an estimated cost of 1 for VF 1 For instruction: %40 = and i64 %33, %39
LV: Found an estimated cost of 0 for VF 1 For instruction: %41 = getelementptr inbounds [0 x i64], ptr %7, i64 0, i64 %22
LV: Found an estimated cost of 2 for VF 1 For instruction: %42 = load i64, ptr %41, align 8, !noundef !3
LV: Found an estimated cost of 1 for VF 1 For instruction: %43 = add i64 %40, %42
LV: Found an estimated cost of 2 for VF 1 For instruction: store i64 %43, ptr %41, align 8
LV: Found an estimated cost of 1 for VF 1 For instruction: %44 = icmp eq i64 %23, 256
LV: Found an estimated cost of 0 for VF 1 For instruction: br i1 %44, label %20, label %21
LV: Scalar loop costs: 21.
LV: Found an estimated cost of 0 for VF 2 For instruction: %22 = phi i64 [ 0, %17 ], [ %23, %21 ]
LV: Found an estimated cost of 1 for VF 2 For instruction: %23 = add nuw nsw i64 %22, 1
LV: Found an estimated cost of 0 for VF 2 For instruction: %24 = getelementptr inbounds [0 x i32], ptr %3, i64 0, i64 %22
LV: Found an estimated cost of 4 for VF 2 For instruction: %25 = load i32, ptr %24, align 4, !noundef !3
LV: Found an estimated cost of 2 for VF 2 For instruction: %26 = add i32 %25, %18
LV: Found an estimated cost of 2 for VF 2 For instruction: %27 = lshr i32 %26, 3
LV: Found an estimated cost of 0 for VF 2 For instruction: %28 = zext nneg i32 %27 to i64
LV: Found an estimated cost of 0 for VF 2 For instruction: %29 = getelementptr inbounds i8, ptr %1, i64 %28
LV: Found an estimated cost of 26 for VF 2 For instruction: %30 = load i64, ptr %29, align 1
LV: Found an estimated cost of 1 for VF 2 For instruction: %31 = and i32 %26, 7
LV: Found an estimated cost of 0 for VF 2 For instruction: %32 = zext nneg i32 %31 to i64
LV: Found an estimated cost of 1 for VF 2 For instruction: %33 = lshr i64 %30, %32
LV: Found an estimated cost of 0 for VF 2 For instruction: %34 = getelementptr inbounds [0 x i32], ptr %5, i64 0, i64 %22
LV: Found an estimated cost of 1 for VF 2 For instruction: %35 = load i32, ptr %34, align 4, !noundef !3
LV: Found an estimated cost of 1 for VF 2 For instruction: %36 = and i32 %35, 63
LV: Found an estimated cost of 0 for VF 2 For instruction: %37 = zext nneg i32 %36 to i64
LV: Found an estimated cost of 1 for VF 2 For instruction: %38 = shl nsw i64 -1, %37
LV: Found an estimated cost of 1 for VF 2 For instruction: %39 = xor i64 %38, -1
LV: Found an estimated cost of 1 for VF 2 For instruction: %40 = and i64 %33, %39
LV: Found an estimated cost of 0 for VF 2 For instruction: %41 = getelementptr inbounds [0 x i64], ptr %7, i64 0, i64 %22
LV: Found an estimated cost of 1 for VF 2 For instruction: %42 = load i64, ptr %41, align 8, !noundef !3
LV: Found an estimated cost of 1 for VF 2 For instruction: %43 = add i64 %40, %42
LV: Found an estimated cost of 1 for VF 2 For instruction: store i64 %43, ptr %41, align 8
LV: Found an estimated cost of 1 for VF 2 For instruction: %44 = icmp eq i64 %23, 256
LV: Found an estimated cost of 0 for VF 2 For instruction: br i1 %44, label %20, label %21
LV: Vector loop of width 2 costs: 23.
...
LV: Found an estimated cost of 52 for VF 4 For instruction: %21 = load i64, ptr %20, align 1
...
LV: Vector loop of width 2 costs: 22.
```
The reason it doesn't vectorize is the enormous load cost for the one random-access lookup. It is true that Neon lacks a vector gather instruction, but the cost shouldn't be nearly as high as 26 (VF=2) or 52 (VF=4); here's what assembly for VF=4 might look like:
```
mov w15, v6.s[2]
mov w14, v6.s[1]
mov w16, v6.s[3]
fmov w17, s6
ldr d16, [x1, w17, uxtw]
ldr d17, [x1, w15, uxtw]
add x14, x1, x14
add x16, x1, x16
ld1 { v16.d }[1], [x14]
ld1 { v17.d }[1], [x16]
```
I think it would be more realistic to model the cost here as 14-16 instead of 52; the `mov`s and `add`s just take a cycle, and the loads go directly into vector registers. I checked the code, and the cost of 52 appears to come mainly from the [NeonNonConstStrideOverhead constant](https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp#L53) of 10 (multiplied by VF=4 explains 40 of the cost). I don't see any explanation for why this cost is so high, and empirically it isn't for my program and CPU.
Can we reduce this constant substantially?
_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs