Issue 156206
Summary [Clang][x86] Bad codegen for assume_aligned with structured binding packs
Labels clang
Assignees
Reporter kaimfrai
    I've been trying to reduce intrinsic-dependent code for SIMD and rely more on compiler optimizations. With structured binding packs this becomes very terse, easy to read and generic. This is where I noticed the following: Clang does not emit an aligned load for assume_aligned when used with a pack of indices:

```
template<typename T, int N>
using vec[[gnu::vector_size(sizeof(T)*N)]] = T;

template<int N>
consteval auto Indices()
{
    std::array<int, N> arr;
    std::iota(arr.begin(), arr.end(), 0);
    return arr;
}

template<int A, typename T, int N>
auto load1(T* p) -> vec<T, N>
{
    auto const[...i] = Indices<N>();
 return vec<T, N>{ std::assume_aligned<alignof(T) * A>(p)[i]... };
}
auto a1 = load1<16, float, 16>;
auto u1 = load1<1, float, 16>;
```
Where I would expect a `moveaps` and `moveups` respectively when using `-std=c++26 -O3 -mavx512f -mavx512vl`, instead it becomes this:
```
        vmovsd xmm0, qword ptr [rdi]
        vmovsd  xmm1, qword ptr [rdi + 8]
 vmovsd  xmm2, qword ptr [rdi + 16]
        vmovsd  xmm3, qword ptr [rdi + 32]
        vinsertf128     ymm0, ymm0, xmm2, 1
        vbroadcastsd ymm2, qword ptr [rdi + 24]
        vunpcklpd       ymm0, ymm0, ymm1
 vblendpd        ymm0, ymm0, ymm2, 8
        vinsertf32x4    zmm0, zmm0, xmm3, 2
        vmovsd  xmm1, qword ptr [rdi + 40]
        vmovapd zmm2, zmmword ptr [rip + .LCPI0_0]
        vpermi2pd       zmm2, zmm0, zmm1
 vmovsd  xmm0, qword ptr [rdi + 48]
        vinsertf32x4    zmm1, zmm2, xmm0, 3
        vmovsd  xmm2, qword ptr [rdi + 56]
        vmovapd zmm0, zmmword ptr [rip + .LCPI0_1]
        vpermi2pd       zmm0, zmm1, zmm2
 ret
```
Even when written by hand and not using a pack, the result is the same. Though in that case you would most likely use a separate variable, in which case the issue goes away. So it's more about the convenience of writing it in a single _expression_. However, GCC optimizes it as expected. Even up to Clang 16 in AVX2 mode this optimizes to a single instruction. However, as soon as `-mavx512vl` is set, even while still using an vector of 8 float, the code gen  suddenly becomes much worse.
```
        vmovsd xmm0, qword ptr [rdi]
        vmovsd  xmm1, qword ptr [rdi + 8]
 vmovsd  xmm2, qword ptr [rdi + 16]
        vmovsd  xmm3, qword ptr [rdi + 24]
        vinsertf128     ymm3, ymm0, xmm3, 1
        vinsertf128 ymm1, ymm0, xmm1, 1
        vperm2f128      ymm1, ymm1, ymm3, 49
 vinsertf128     ymm0, ymm0, xmm2, 1
        vunpcklpd       ymm0, ymm0, ymm1
        ret
```
I also tried casting to an aligned structure, which I would prefer not to do. While this resulted in a single instruction, despite the alignment it was still a `moveups`. See https://godbolt.org/z/TdTarrKsK for the full example.

This isn't a pressing issue, as there is a workaround by using a separate variable, still the inconsistency was surprising.

_______________________________________________
llvm-bugs mailing list
llvm-bugs@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

Reply via email to