[Bug libstdc++/115454] New: std::experimental::find_last_set is buggy on x86-64-v4
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115454 Bug ID: 115454 Summary: std::experimental::find_last_set is buggy on x86-64-v4 Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: lee.imple at gmail dot com Target Milestone: --- We are using a simd with 4 uint64_t elements (with `deduce_t`) on x86-64-v4. We are trying to find the last element with the value -1 (i.e. all 1s). The code is as following (available at https://godbolt.org/z/3f1nszf8E ), compiled with the options `-O3 -march=x86-64-v4 -std=c++20`. ```c++ #include #include using deduce_t_element = std::experimental::simd< std::uint64_t, std::experimental::simd_abi::deduce_t >; using fixed_size_element = std::experimental::simd< std::uint64_t, std::experimental::simd_abi::fixed_size<4> >; int f_deduce(deduce_t_element e) { return find_last_set(e != -1); } int f_fixed_size(fixed_size_element e) { return find_last_set(e != -1); } ``` G++ trunk gives the following assembly (I add the comments). ```asm f_deduce(std::experimental::parallelism_v2::simd >): vpcmpeqd %ymm1, %ymm1, %ymm1 movl $-1, %eax vpcmpq $4, %ymm1, %ymm0, %k0 kmovb %k0, %edx orb $-16, %dl# %dl |= 0xf0 je .L1 movzbl %dl, %edx movl $31, %eax lzcntl %edx, %edx# leading zeros is 24 # because the next byte is always 0b subl %edx, %eax # so we get the result is 31 - 24 = 7 .L1: ret f_fixed_size(std::experimental::parallelism_v2::simd >): vpcmpeqd %ymm0, %ymm0, %ymm0 movl $63, %edx vpcmpq $4, (%rdi), %ymm0, %k0 kmovb %k0, %eax andl $15, %eax lzcntq %rax, %rax subl %eax, %edx movl %edx, %eax vzeroupper ret ``` In fact, the first function always gives the result 7 whatever argument it gets, which is more obvious from clang++'s result. ```asm f_deduce(std::experimental::parallelism_v2::simd>): # @f_deduce(std::experimental::parallelism_v2::simd>) movl $7, %eax retq f_fixed_size(std::experimental::parallelism_v2::simd>): # @f_fixed_size(std::experimental::parallelism_v2::simd>) vpcmpeqd %ymm0, %ymm0, %ymm0 vpcmpeqq (%rdi), %ymm0, %k0 kmovd %k0, %eax xorb $15, %al movzbl %al, %eax lzcntq %rax, %rcx movl $63, %eax subl %ecx, %eax vzeroupper retq ``` I don't know why, but compiled result of `fixed_size_simd` seems to be different.
[Bug tree-optimization/114966] fails to optimize avx2 in-register permute written with std::experimental::simd
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114966 --- Comment #1 from Imple Lee --- This is probably a regression. GCC 13.2 can generate optimal code. See https://godbolt.org/z/4n8ovr7jr .
[Bug tree-optimization/114908] fails to optimize avx2 in-register permute written with std::experimental::simd
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114908 --- Comment #8 from Imple Lee --- I tried another way to permute the register. Although GCC does generate simd instructions, the generated code is sub-optimal. I opened PR114966 for that.
[Bug tree-optimization/114966] New: fails to optimize avx2 in-register permute written with std::experimental::simd
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114966 Bug ID: 114966 Summary: fails to optimize avx2 in-register permute written with std::experimental::simd Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: lee.imple at gmail dot com Target Milestone: --- This is actually another attempt to permute a simd register with std::experimental::simd, like PR114908, but written differently. Following is the same function written in both std::experimental::simd and GNU vector extension versions (available online at https://godbolt.org/z/n3WvqcePo ). The purpose is to permute the register from [w, x, y, z] into [0, w, x, y]. ```c++ #include #include namespace stdx = std::experimental; using data_t = std::uint64_t; constexpr std::size_t data_size = 4; template using simd_of = std::experimental::simd>; using simd_t = simd_of; // stdx version simd_t permute_simd(simd_t data) { return simd_t([=](auto i) -> data_t { constexpr size_t index = i - 1; if constexpr (index < data_size) { return data[index]; } else { return 0; } }); } typedef data_t vector_t [[gnu::vector_size(data_size * sizeof(data_t))]]; // gnu vector extension version vector_t permute_vector(vector_t data) { return __builtin_shufflevector(data, vector_t{0}, 4, 0, 1, 2); } ``` The code is compiled with the options `-O3 -march=x86-64-v3 -std=c++20`. Although they should have the same functionality, generated assembly (by GCC) is so different. ```asm permute_simd(std::experimental::parallelism_v2::simd >): vmovq %xmm0, %rax vpsrldq $8, %xmm0, %xmm1 vextracti128 $0x1, %ymm0, %xmm0 vpunpcklqdq %xmm0, %xmm1, %xmm1 vpxor %xmm0, %xmm0, %xmm0 vpinsrq $1, %rax, %xmm0, %xmm0 vinserti128 $0x1, %xmm1, %ymm0, %ymm0 ret permute_vector(unsigned long __vector(4)): vpxor %xmm1, %xmm1, %xmm1 vpermq $144, %ymm0, %ymm0 vpblendd $3, %ymm1, %ymm0, %ymm0 ret ``` However, Clang can optimize `permute_simd` into the same assembly as `permute_vector`, so I think, instead of a bug in the std::experimental::simd, it is a missed optimization in GCC. ```asm permute_simd(std::experimental::parallelism_v2::simd >): # @permute_simd(std::experimental::parallelism_v2::simd >) vpermpd $144, %ymm0, %ymm0 # ymm0 = ymm0[0,0,1,2] vxorps %xmm1, %xmm1, %xmm1 vblendps$3, %ymm1, %ymm0, %ymm0 # ymm0 = ymm1[0,1],ymm0[2,3,4,5,6,7] retq permute_vector(unsigned long __vector(4)):# @permute_vector(unsigned long __vector(4)) vpermpd $144, %ymm0, %ymm0 # ymm0 = ymm0[0,0,1,2] vxorps %xmm1, %xmm1, %xmm1 vblendps$3, %ymm1, %ymm0, %ymm0 # ymm0 = ymm1[0,1],ymm0[2,3,4,5,6,7] retq ```
[Bug target/114908] New: fails to optimize avx2 in-register permute written with std::experimental::simd
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114908 Bug ID: 114908 Summary: fails to optimize avx2 in-register permute written with std::experimental::simd Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: lee.imple at gmail dot com Target Milestone: --- I am trying to write simd code with std::experimental::simd. Here is the same function written in both std::experimental::simd and GNU vector extension versions (available online at https://godbolt.org/z/dc169rY3o ). The purpose is to permute the register from [w, x, y, z] into [0, w, x, y]. ```c++ #include #include namespace stdx = std::experimental; using data_t = std::uint64_t; constexpr std::size_t data_size = 4; template using simd_of = std::experimental::simd>; using simd_t = simd_of; template constexpr simd_of zero = {}; // stdx version simd_t permute_simd(simd_t data) { auto [carry, _] = split(data); return concat(zero<1>, carry); } typedef data_t vector_t [[gnu::vector_size(data_size * sizeof(data_t))]]; constexpr vector_t zero_v = {0}; // gnu vector extension version vector_t permute_vector(vector_t data) { return __builtin_shufflevector(data, zero_v, 4, 0, 1, 2); } ``` The code is compiled with the options `-O3 -march=x86-64-v3 -std=c++20`. Although they should have the same functionality, generated assembly (by GCC) is so different. ```asm permute_simd(std::experimental::parallelism_v2::simd >): pushq %rbp vpxor %xmm1, %xmm1, %xmm1 movq %rsp, %rbp andq $-32, %rsp subq $8, %rsp vmovdqa %ymm0, -120(%rsp) vmovdqa %ymm1, -56(%rsp) movq -104(%rsp), %rax vmovdqa %xmm0, -56(%rsp) movq -48(%rsp), %rdx movq $0, -88(%rsp) movq %rax, -40(%rsp) movq -56(%rsp), %rax vmovdqa -56(%rsp), %ymm2 vmovq %rax, %xmm0 vmovdqa %ymm2, -24(%rsp) movq -8(%rsp), %rax vpinsrq $1, %rdx, %xmm0, %xmm0 vmovdqu %xmm0, -80(%rsp) movq %rax, -64(%rsp) vmovdqa -88(%rsp), %ymm0 leave ret permute_vector(unsigned long __vector(4)): vpxor %xmm1, %xmm1, %xmm1 vpermq $144, %ymm0, %ymm0 vpblendd $3, %ymm1, %ymm0, %ymm0 ret ``` However, Clang can optimize `permute_simd` into the same assembly as `permute_vector`, so I think, instead of a bug in the std::experimental::simd, it is a missed optimization in GCC. ```asm permute_simd(std::experimental::parallelism_v2::simd >): # @permute_simd(std::experimental::parallelism_v2::simd >) vpermpd $144, %ymm0, %ymm0 # ymm0 = ymm0[0,0,1,2] vxorps %xmm1, %xmm1, %xmm1 vblendps$3, %ymm1, %ymm0, %ymm0 # ymm0 = ymm1[0,1],ymm0[2,3,4,5,6,7] retq permute_vector(unsigned long __vector(4)):# @permute_vector(unsigned long __vector(4)) vpermpd $144, %ymm0, %ymm0 # ymm0 = ymm0[0,0,1,2] vxorps %xmm1, %xmm1, %xmm1 vblendps$3, %ymm1, %ymm0, %ymm0 # ymm0 = ymm1[0,1],ymm0[2,3,4,5,6,7] retq ```
[Bug libstdc++/114417] std::experimental::simd is not a POD (by ABI definitions) and is always passed by reference instead of by value
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114417 --- Comment #11 from Imple Lee --- > What you want to use instead is std::experimental::simd_abi::deduce_t. > That'll give you a not-fixed_size ABI if one exists. And those will likely be > passed via registers (as long as the psABI allows). Great! It does work as intended. Thank you for telling me that. Maybe all I need is just to read the docs on cppref more carefully :|
[Bug libstdc++/114417] std::experimental::simd is not a POD (by ABI definitions) and is always passed by reference instead of by value
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114417 --- Comment #7 from Imple Lee --- I tried to dig into the source code and it seems like it was designed to be "passed via the stack". Not sure whether this was specified by the specification (did not find relevant requirements, but I am not quite familiar with that) or just an implementation choice. In GCC git tree [libstdc++-v3/include/experimental/bits/simd_fixed_size.h, line 27](https://gcc.gnu.org/git?p=gcc.git;a=blob;f=libstdc%2B%2B-v3/include/experimental/bits/simd_fixed_size.h;h=408855212979cc32699db0805079ac74f495a8fa;hb=HEAD#l27): ... * The fixed_size ABI gives the following guarantees: * - simd objects are passed via the stack ...
[Bug target/114417] simd parameters are passed by memory on x64 , not using the available sse registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114417 --- Comment #3 from Imple Lee --- Oh, I didn't make it clear. I am describing libstdc++'s std::experimental::simd class.
[Bug target/114417] simd parameters are passed by memory on x64 , not using the available sse registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114417 --- Comment #2 from Imple Lee --- (In reply to Andrew Pinski from comment #1) > I doubt this can change since this is the abi gcc decided on a long time ago. If we implement the simd class as a wrapper around a vector, the parameter can still be passed by sse registers, so I think there may be an implementation issue in libstdc++'s implementation of stdx::simd. https://godbolt.org/z/a6s67zzc7
[Bug libstdc++/114417] New: simd parameters are passed by memory on x64 , not using the available sse registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114417 Bug ID: 114417 Summary: simd parameters are passed by memory on x64 , not using the available sse registers Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: lee.imple at gmail dot com Target Milestone: --- https://godbolt.org/z/3GYnadqc1 In current implementation, SIMD parameters are passed by memory, while the equivalent vector parameters are passed by SSE registers. If the equivalent vector parameters can be passed by SSE registers, can we use SSE registers for SIMD parameters? Maybe the performance difference is not so significant, but I just want to keep everything in registers.