Issue 91302
Summary SIMD instructions that write directly to `k` registers on AVX512-enabled machines are slower than using AVX/AVX2 equivalent-routines
Labels new issue
Assignees
Reporter Validark
    I wrote some test code:

```zig
const Chunk = @Vector(32, u8);

export fn foo(vec1: Chunk, vec2: Chunk, vec3: Chunk, vec4: Chunk) Chunk {
    const true_vec = @as(Chunk, @splat(0xFF));
    const false_vec = @as(Chunk, @splat(0));
    return @select(u8, vec1 == vec2, true_vec, false_vec) |
           @select(u8, vec3 == vec4, true_vec, false_vec);
}
```


```llvm
define dso_local <32 x i8> @foo(<32 x i8> %0, <32 x i8> %1, <32 x i8> %2, <32 x i8> %3) local_unnamed_addr {
Entry:
  %4 = icmp eq <32 x i8> %0, %1
  %5 = icmp eq <32 x i8> %2, %3
  %6 = or <32 x i1> %5, %4
  %7 = sext <32 x i1> %6 to <32 x i8>
  ret <32 x i8> %7
}

declare void @llvm.dbg.value(metadata, metadata, metadata) #1
```

For Zen 3, I get the following emit:

```asm
foo:
        vpcmpeqb ymm0, ymm0, ymm1
        vpcmpeqb        ymm2, ymm2, ymm3
        vpor ymm0, ymm2, ymm0
        ret
```

For Zen 4, I get the following emit:

```asm
foo:
        vpcmpeqb        k0, ymm0, ymm1
        vpcmpeqb        k1, ymm2, ymm3
        kord    k0, k1, k0
        vpmovm2b        ymm0, k0
        ret
```

[Godbolt link](https://zig.godbolt.org/z/dKGo9n9KW)

On one hand, I can see how the LLVM IR trivially maps to the AVX512 way of doing things. However, I wonder if this isn't a step backwards in terms of performance. On Zen 4, [according to uops.info](https://uops.info/table.html?search=vpcmpeqb&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_ADLP=on&cb_ZEN4=on&cb_measurements=on&cb_doc=on&cb_base=on&cb_aes=on&cb_avx=on&cb_avx2=on&cb_avx512=on), [VPCMPEQB (YMM, YMM, YMM)](https://uops.info/html-instr/VPCMPEQB_YMM_YMM_YMM.html) has a latency of 1 and a throughput of 0.25. On the other hand, [VPCMPEQB_EVEX (K, YMM, YMM)](https://uops.info/html-instr/VPCMPEQB_EVEX_K_YMM_YMM.html) has a latency of 4 and a throughput of 0.5. The `vpor` and `kord` instructions both have a latency of 1. The [VPMOVM2B (YMM, K)](https://uops.info/html-instr/VPMOVM2B_YMM_K.html) also has a latency of 1 on Zen 4.

That means that for the Zen 3 code, we can run both of the `vpcmpeqb` in 1 cycle, and then the `vpor` in the next cycle. That's 2 cycles in total. For the Zen 4 code, we can run both of its `vpcmpeqb` instructions simultaneously, but it will take 4 cycles, then the remaining 2 instructions will take 1 cycle per each, for a total of 6 cycles. So the AVX/AVX2 code should be ~3 times faster, based on an analysis that exclusively considers latency and throughput. The AVX512 code is also more constrained in terms of port usage too.
_______________________________________________
llvm-bugs mailing list
llvm-bugs@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

Reply via email to