Issue 169702
Summary NeonNonConstStrideOverhead is too high
Labels new issue
Assignees
Reporter mwlon
    I have a small piece of extremely performance-sensitive decompression code ([godbolt with code](https://godbolt.org/z/fof18KM8E)) that should vectorize for optimal performance but does not on aarch64+neon. I have manually vectorized it and gotten substantially improved performance; and with some hacks that I can't apply generally I found a way to make LLVM vectorize it in on case, and also observed improved performance. I believe this code should clearly be vectorized by LLVM.

To debug, I built Rust with LLVM debug on and observed how it was reasoning about the cost of vectorization:

```
LV: Found an estimated cost of 0 for VF 1 For instruction:   %22 = phi i64 [ 0, %17 ], [ %23, %21 ]
LV: Found an estimated cost of 1 for VF 1 For instruction:   %23 = add nuw nsw i64 %22, 1
LV: Found an estimated cost of 0 for VF 1 For instruction:   %24 = getelementptr inbounds [0 x i32], ptr %3, i64 0, i64 %22
LV: Found an estimated cost of 2 for VF 1 For instruction:   %25 = load i32, ptr %24, align 4, !noundef !3
LV: Found an estimated cost of 1 for VF 1 For instruction:   %26 = add i32 %25, %18
LV: Found an estimated cost of 1 for VF 1 For instruction:   %27 = lshr i32 %26, 3
LV: Found an estimated cost of 0 for VF 1 For instruction:   %28 = zext nneg i32 %27 to i64
LV: Found an estimated cost of 0 for VF 1 For instruction:   %29 = getelementptr inbounds i8, ptr %1, i64 %28
LV: Found an estimated cost of 2 for VF 1 For instruction:   %30 = load i64, ptr %29, align 1
LV: Found an estimated cost of 1 for VF 1 For instruction:   %31 = and i32 %26, 7
LV: Found an estimated cost of 0 for VF 1 For instruction:   %32 = zext nneg i32 %31 to i64
LV: Found an estimated cost of 1 for VF 1 For instruction:   %33 = lshr i64 %30, %32
LV: Found an estimated cost of 0 for VF 1 For instruction:   %34 = getelementptr inbounds [0 x i32], ptr %5, i64 0, i64 %22
LV: Found an estimated cost of 2 for VF 1 For instruction:   %35 = load i32, ptr %34, align 4, !noundef !3
LV: Found an estimated cost of 1 for VF 1 For instruction:   %36 = and i32 %35, 63
LV: Found an estimated cost of 0 for VF 1 For instruction:   %37 = zext nneg i32 %36 to i64
LV: Found an estimated cost of 1 for VF 1 For instruction:   %38 = shl nsw i64 -1, %37
LV: Found an estimated cost of 1 for VF 1 For instruction:   %39 = xor i64 %38, -1
LV: Found an estimated cost of 1 for VF 1 For instruction:   %40 = and i64 %33, %39
LV: Found an estimated cost of 0 for VF 1 For instruction:   %41 = getelementptr inbounds [0 x i64], ptr %7, i64 0, i64 %22
LV: Found an estimated cost of 2 for VF 1 For instruction:   %42 = load i64, ptr %41, align 8, !noundef !3
LV: Found an estimated cost of 1 for VF 1 For instruction:   %43 = add i64 %40, %42
LV: Found an estimated cost of 2 for VF 1 For instruction:   store i64 %43, ptr %41, align 8
LV: Found an estimated cost of 1 for VF 1 For instruction:   %44 = icmp eq i64 %23, 256
LV: Found an estimated cost of 0 for VF 1 For instruction:   br i1 %44, label %20, label %21
LV: Scalar loop costs: 21.
LV: Found an estimated cost of 0 for VF 2 For instruction:   %22 = phi i64 [ 0, %17 ], [ %23, %21 ]
LV: Found an estimated cost of 1 for VF 2 For instruction:   %23 = add nuw nsw i64 %22, 1
LV: Found an estimated cost of 0 for VF 2 For instruction:   %24 = getelementptr inbounds [0 x i32], ptr %3, i64 0, i64 %22
LV: Found an estimated cost of 4 for VF 2 For instruction:   %25 = load i32, ptr %24, align 4, !noundef !3
LV: Found an estimated cost of 2 for VF 2 For instruction:   %26 = add i32 %25, %18
LV: Found an estimated cost of 2 for VF 2 For instruction:   %27 = lshr i32 %26, 3
LV: Found an estimated cost of 0 for VF 2 For instruction:   %28 = zext nneg i32 %27 to i64
LV: Found an estimated cost of 0 for VF 2 For instruction:   %29 = getelementptr inbounds i8, ptr %1, i64 %28
LV: Found an estimated cost of 26 for VF 2 For instruction:   %30 = load i64, ptr %29, align 1
LV: Found an estimated cost of 1 for VF 2 For instruction:   %31 = and i32 %26, 7
LV: Found an estimated cost of 0 for VF 2 For instruction:   %32 = zext nneg i32 %31 to i64
LV: Found an estimated cost of 1 for VF 2 For instruction:   %33 = lshr i64 %30, %32
LV: Found an estimated cost of 0 for VF 2 For instruction:   %34 = getelementptr inbounds [0 x i32], ptr %5, i64 0, i64 %22
LV: Found an estimated cost of 1 for VF 2 For instruction:   %35 = load i32, ptr %34, align 4, !noundef !3
LV: Found an estimated cost of 1 for VF 2 For instruction:   %36 = and i32 %35, 63
LV: Found an estimated cost of 0 for VF 2 For instruction:   %37 = zext nneg i32 %36 to i64
LV: Found an estimated cost of 1 for VF 2 For instruction:   %38 = shl nsw i64 -1, %37
LV: Found an estimated cost of 1 for VF 2 For instruction:   %39 = xor i64 %38, -1
LV: Found an estimated cost of 1 for VF 2 For instruction:   %40 = and i64 %33, %39
LV: Found an estimated cost of 0 for VF 2 For instruction:   %41 = getelementptr inbounds [0 x i64], ptr %7, i64 0, i64 %22
LV: Found an estimated cost of 1 for VF 2 For instruction:   %42 = load i64, ptr %41, align 8, !noundef !3
LV: Found an estimated cost of 1 for VF 2 For instruction:   %43 = add i64 %40, %42
LV: Found an estimated cost of 1 for VF 2 For instruction:   store i64 %43, ptr %41, align 8
LV: Found an estimated cost of 1 for VF 2 For instruction:   %44 = icmp eq i64 %23, 256
LV: Found an estimated cost of 0 for VF 2 For instruction:   br i1 %44, label %20, label %21
LV: Vector loop of width 2 costs: 23.
...
LV: Found an estimated cost of 52 for VF 4 For instruction:   %21 = load i64, ptr %20, align 1
...
LV: Vector loop of width 2 costs: 22.
```

The reason it doesn't vectorize is the enormous load cost for the one random-access lookup. It is true that Neon lacks a vector gather instruction, but the cost shouldn't be nearly as high as 26 (VF=2) or 52 (VF=4); here's what assembly for VF=4 might look like:
```
        mov     w15, v6.s[2]
        mov w14, v6.s[1]
        mov     w16, v6.s[3]
        fmov    w17, s6
 ldr     d16, [x1, w17, uxtw]
        ldr     d17, [x1, w15, uxtw]
 add     x14, x1, x14
        add     x16, x1, x16
        ld1     { v16.d }[1], [x14]
        ld1     { v17.d }[1], [x16]
```

I think it would be more realistic to model the cost here as 14-16 instead of 52; the `mov`s and `add`s just take a cycle, and the loads go directly into vector registers. I checked the code, and the cost of 52 appears to come mainly from the [NeonNonConstStrideOverhead constant](https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp#L53) of 10 (multiplied by VF=4 explains 40 of the cost). I don't see any explanation for why this cost is so high, and empirically it isn't for my program and CPU.

Can we reduce this constant substantially?
_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

Reply via email to