Issue |
156206
|
Summary |
[Clang][x86] Bad codegen for assume_aligned with structured binding packs
|
Labels |
clang
|
Assignees |
|
Reporter |
kaimfrai
|
I've been trying to reduce intrinsic-dependent code for SIMD and rely more on compiler optimizations. With structured binding packs this becomes very terse, easy to read and generic. This is where I noticed the following: Clang does not emit an aligned load for assume_aligned when used with a pack of indices:
```
template<typename T, int N>
using vec[[gnu::vector_size(sizeof(T)*N)]] = T;
template<int N>
consteval auto Indices()
{
std::array<int, N> arr;
std::iota(arr.begin(), arr.end(), 0);
return arr;
}
template<int A, typename T, int N>
auto load1(T* p) -> vec<T, N>
{
auto const[...i] = Indices<N>();
return vec<T, N>{ std::assume_aligned<alignof(T) * A>(p)[i]... };
}
auto a1 = load1<16, float, 16>;
auto u1 = load1<1, float, 16>;
```
Where I would expect a `moveaps` and `moveups` respectively when using `-std=c++26 -O3 -mavx512f -mavx512vl`, instead it becomes this:
```
vmovsd xmm0, qword ptr [rdi]
vmovsd xmm1, qword ptr [rdi + 8]
vmovsd xmm2, qword ptr [rdi + 16]
vmovsd xmm3, qword ptr [rdi + 32]
vinsertf128 ymm0, ymm0, xmm2, 1
vbroadcastsd ymm2, qword ptr [rdi + 24]
vunpcklpd ymm0, ymm0, ymm1
vblendpd ymm0, ymm0, ymm2, 8
vinsertf32x4 zmm0, zmm0, xmm3, 2
vmovsd xmm1, qword ptr [rdi + 40]
vmovapd zmm2, zmmword ptr [rip + .LCPI0_0]
vpermi2pd zmm2, zmm0, zmm1
vmovsd xmm0, qword ptr [rdi + 48]
vinsertf32x4 zmm1, zmm2, xmm0, 3
vmovsd xmm2, qword ptr [rdi + 56]
vmovapd zmm0, zmmword ptr [rip + .LCPI0_1]
vpermi2pd zmm0, zmm1, zmm2
ret
```
Even when written by hand and not using a pack, the result is the same. Though in that case you would most likely use a separate variable, in which case the issue goes away. So it's more about the convenience of writing it in a single _expression_. However, GCC optimizes it as expected. Even up to Clang 16 in AVX2 mode this optimizes to a single instruction. However, as soon as `-mavx512vl` is set, even while still using an vector of 8 float, the code gen suddenly becomes much worse.
```
vmovsd xmm0, qword ptr [rdi]
vmovsd xmm1, qword ptr [rdi + 8]
vmovsd xmm2, qword ptr [rdi + 16]
vmovsd xmm3, qword ptr [rdi + 24]
vinsertf128 ymm3, ymm0, xmm3, 1
vinsertf128 ymm1, ymm0, xmm1, 1
vperm2f128 ymm1, ymm1, ymm3, 49
vinsertf128 ymm0, ymm0, xmm2, 1
vunpcklpd ymm0, ymm0, ymm1
ret
```
I also tried casting to an aligned structure, which I would prefer not to do. While this resulted in a single instruction, despite the alignment it was still a `moveups`. See https://godbolt.org/z/TdTarrKsK for the full example.
This isn't a pressing issue, as there is a workaround by using a separate variable, still the inconsistency was surprising.
_______________________________________________
llvm-bugs mailing list
llvm-bugs@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs