Issue 176652
Summary [X86] manual `avg` optimizes poorly
Labels new issue
Assignees
Reporter folkertdev
    In short, LLVM optimizes `src` into `tgt`, but `tgt` generates much worse code:

https://godbolt.org/z/dWaj6cvYe

```llvm
; with +avx2

define noundef i32 @src(<4 x i64> %x, <4 x i64> %y) {
bb2:
  %_4.i = bitcast <4 x i64> %x to <32 x i8>
  %0 = zext <32 x i8> %_4.i to <32 x i16>
  %_6.i = bitcast <4 x i64> %y to <32 x i8>
  %1 = zext <32 x i8> %_6.i to <32 x i16>
  %2 = add nuw nsw <32 x i16> %1, %0
  %3 = add nuw nsw <32 x i16> %2, splat (i16 1)
  %4 = lshr <32 x i16> %3, splat (i16 1)
  %5 = trunc nuw <32 x i16> %4 to <32 x i8>
  %_0.i = bitcast <32 x i8> %5 to <4 x i64>
  %6 = icmp sgt <32 x i8> zeroinitializer, %5
  %7 = bitcast <32 x i1> %6 to i32
 ret i32 %7
}

define noundef i32 @tgt(<4 x i64> %x, <4 x i64> %y) {
start:
  %_4.i = bitcast <4 x i64> %x to <32 x i8>
  %0 = zext <32 x i8> %_4.i to <32 x i16>
  %_6.i = bitcast <4 x i64> %y to <32 x i8>
  %1 = zext <32 x i8> %_6.i to <32 x i16>
  %2 = add nuw nsw <32 x i16> %0, splat (i16 1)
  %3 = add nuw nsw <32 x i16> %2, %1
  %4 = and <32 x i16> %3, splat (i16 256)
  %5 = icmp ne <32 x i16> %4, zeroinitializer
  %6 = bitcast <32 x i1> %5 to i32
  ret i32 %6
}
```

Specifically (the full example and optimization pipeline is here https://rust.godbolt.org/z/crf61YKMj), `InstCombinePass` turns

```llvm
  %4 = lshr <32 x i16> %3, splat (i16 1)
  %5 = trunc nuw <32 x i16> %4 to <32 x i8>
  %_0.i = bitcast <32 x i8> %5 to <4 x i64>
  %6 = icmp sgt <32 x i8> zeroinitializer, %5
```

into

```llvm
  %4 = and <32 x i16> %3, splat (i16 256)
  %5 = icmp ne <32 x i16> %4, zeroinitializer
```

On its own that does seem better, but the optimization to `avg` is now missed, plus the operations are on the non-legal `32 x i16` type causing terrible codegen:

```asm
src:
 vpavgb  ymm0, ymm1, ymm0
        vpmovmskb       eax, ymm0
 vzeroupper
        ret
tgt:
        vpmovzxbw       ymm2, xmm0
 vextracti128    xmm0, ymm0, 1
        vpmovzxbw       ymm0, xmm0
 vpmovzxbw       ymm3, xmm1
        vpaddw  ymm2, ymm2, ymm3
 vextracti128    xmm1, ymm1, 1
        vpmovzxbw       ymm1, xmm1
 vpaddw  ymm0, ymm0, ymm1
        vpcmpeqd        ymm1, ymm1, ymm1
 vpsubw  ymm2, ymm2, ymm1
        vpsubw  ymm0, ymm0, ymm1
        vpsllw ymm0, ymm0, 7
        vpsllw  ymm1, ymm2, 7
        vpacksswb       ymm0, ymm1, ymm0
        vpermq  ymm0, ymm0, 216
        vpmovmskb       eax, ymm0
        vzeroupper
        ret
```

This was reported here https://github.com/rust-lang/rust/issues/124216. https://github.com/llvm/llvm-project/issues/132166 is tangentially related.

I'm not sure whether the `avg` can reasonably be recovered, but it should be possible to do better than what's happening?

_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

Reply via email to