https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115454
Bug ID: 115454 Summary: std::experimental::find_last_set is buggy on x86-64-v4 Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: lee.imple at gmail dot com Target Milestone: --- We are using a simd with 4 uint64_t elements (with `deduce_t`) on x86-64-v4. We are trying to find the last element with the value -1 (i.e. all 1s). The code is as following (available at https://godbolt.org/z/3f1nszf8E ), compiled with the options `-O3 -march=x86-64-v4 -std=c++20`. ```c++ #include <experimental/simd> #include <cstdint> using deduce_t_element = std::experimental::simd< std::uint64_t, std::experimental::simd_abi::deduce_t<std::uint64_t, 4> >; using fixed_size_element = std::experimental::simd< std::uint64_t, std::experimental::simd_abi::fixed_size<4> >; int f_deduce(deduce_t_element e) { return find_last_set(e != -1); } int f_fixed_size(fixed_size_element e) { return find_last_set(e != -1); } ``` G++ trunk gives the following assembly (I add the comments). ```asm f_deduce(std::experimental::parallelism_v2::simd<unsigned long, std::experimental::parallelism_v2::simd_abi::_VecBltnBtmsk<32> >): vpcmpeqd %ymm1, %ymm1, %ymm1 movl $-1, %eax vpcmpq $4, %ymm1, %ymm0, %k0 kmovb %k0, %edx orb $-16, %dl # %dl |= 0xf0 je .L1 movzbl %dl, %edx movl $31, %eax lzcntl %edx, %edx # leading zeros is 24 # because the next byte is always 0b1111 subl %edx, %eax # so we get the result is 31 - 24 = 7 .L1: ret f_fixed_size(std::experimental::parallelism_v2::simd<unsigned long, std::experimental::parallelism_v2::simd_abi::_Fixed<4> >): vpcmpeqd %ymm0, %ymm0, %ymm0 movl $63, %edx vpcmpq $4, (%rdi), %ymm0, %k0 kmovb %k0, %eax andl $15, %eax lzcntq %rax, %rax subl %eax, %edx movl %edx, %eax vzeroupper ret ``` In fact, the first function always gives the result 7 whatever argument it gets, which is more obvious from clang++'s result. ```asm f_deduce(std::experimental::parallelism_v2::simd<unsigned long, std::experimental::parallelism_v2::simd_abi::_VecBltnBtmsk<32>>): # @f_deduce(std::experimental::parallelism_v2::simd<unsigned long, std::experimental::parallelism_v2::simd_abi::_VecBltnBtmsk<32>>) movl $7, %eax retq f_fixed_size(std::experimental::parallelism_v2::simd<unsigned long, std::experimental::parallelism_v2::simd_abi::_Fixed<4>>): # @f_fixed_size(std::experimental::parallelism_v2::simd<unsigned long, std::experimental::parallelism_v2::simd_abi::_Fixed<4>>) vpcmpeqd %ymm0, %ymm0, %ymm0 vpcmpeqq (%rdi), %ymm0, %k0 kmovd %k0, %eax xorb $15, %al movzbl %al, %eax lzcntq %rax, %rcx movl $63, %eax subl %ecx, %eax vzeroupper retq ``` I don't know why, but compiled result of `fixed_size_simd` seems to be different.