[Bug libstdc++/115454] New: std::experimental::find_last_set is buggy on x86-64-v4

2024-06-11 Thread lee.imple at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115454

Bug ID: 115454
   Summary: std::experimental::find_last_set is buggy on x86-64-v4
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: libstdc++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: lee.imple at gmail dot com
  Target Milestone: ---

We are using a simd with 4 uint64_t elements (with `deduce_t`) on x86-64-v4.
We are trying to find the last element with the value -1 (i.e. all 1s).

The code is as following (available at https://godbolt.org/z/3f1nszf8E ),
compiled with the options `-O3 -march=x86-64-v4 -std=c++20`.

```c++
#include 
#include 

using deduce_t_element = std::experimental::simd<
std::uint64_t,
std::experimental::simd_abi::deduce_t
>;
using fixed_size_element = std::experimental::simd<
std::uint64_t,
std::experimental::simd_abi::fixed_size<4>
>;

int f_deduce(deduce_t_element e) {
return find_last_set(e != -1);
}

int f_fixed_size(fixed_size_element e) {
return find_last_set(e != -1);
}
```

G++ trunk gives the following assembly (I add the comments).

```asm
f_deduce(std::experimental::parallelism_v2::simd >):
  vpcmpeqd %ymm1, %ymm1, %ymm1
  movl $-1, %eax
  vpcmpq $4, %ymm1, %ymm0, %k0
  kmovb %k0, %edx
  orb $-16, %dl# %dl |= 0xf0
  je .L1
  movzbl %dl, %edx 
  movl $31, %eax
  lzcntl %edx, %edx# leading zeros is 24 
   # because the next byte is always 0b
  subl %edx, %eax  # so we get the result is 31 - 24 = 7
.L1:
  ret
f_fixed_size(std::experimental::parallelism_v2::simd >):
  vpcmpeqd %ymm0, %ymm0, %ymm0
  movl $63, %edx
  vpcmpq $4, (%rdi), %ymm0, %k0
  kmovb %k0, %eax
  andl $15, %eax
  lzcntq %rax, %rax
  subl %eax, %edx
  movl %edx, %eax
  vzeroupper
  ret
```

In fact, the first function always gives the result 7 whatever argument it
gets, which is more obvious from clang++'s result.

```asm
f_deduce(std::experimental::parallelism_v2::simd>): #
@f_deduce(std::experimental::parallelism_v2::simd>)
  movl $7, %eax
  retq
f_fixed_size(std::experimental::parallelism_v2::simd>): #
@f_fixed_size(std::experimental::parallelism_v2::simd>)
  vpcmpeqd %ymm0, %ymm0, %ymm0
  vpcmpeqq (%rdi), %ymm0, %k0
  kmovd %k0, %eax
  xorb $15, %al
  movzbl %al, %eax
  lzcntq %rax, %rcx
  movl $63, %eax
  subl %ecx, %eax
  vzeroupper
  retq
```

I don't know why, but compiled result of `fixed_size_simd` seems to be
different.

[Bug tree-optimization/114966] fails to optimize avx2 in-register permute written with std::experimental::simd

2024-05-06 Thread lee.imple at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114966

--- Comment #1 from Imple Lee  ---
This is probably a regression.
GCC 13.2 can generate optimal code.

See https://godbolt.org/z/4n8ovr7jr .

[Bug tree-optimization/114908] fails to optimize avx2 in-register permute written with std::experimental::simd

2024-05-06 Thread lee.imple at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114908

--- Comment #8 from Imple Lee  ---
I tried another way to permute the register.
Although GCC does generate simd instructions, the generated code is
sub-optimal.
I opened PR114966 for that.

[Bug tree-optimization/114966] New: fails to optimize avx2 in-register permute written with std::experimental::simd

2024-05-06 Thread lee.imple at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114966

Bug ID: 114966
   Summary: fails to optimize avx2 in-register permute written
with std::experimental::simd
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: lee.imple at gmail dot com
  Target Milestone: ---

This is actually another attempt to permute a simd register with
std::experimental::simd, like PR114908, but written differently.

Following is the same function written in both std::experimental::simd and GNU
vector extension versions (available online at https://godbolt.org/z/n3WvqcePo
).
The purpose is to permute the register from [w, x, y, z] into [0, w, x, y].

```c++
#include 
#include 
namespace stdx = std::experimental;

using data_t = std::uint64_t;
constexpr std::size_t data_size = 4;

template 
using simd_of = std::experimental::simd>;
using simd_t = simd_of;

// stdx version
simd_t permute_simd(simd_t data) {
return simd_t([=](auto i) -> data_t {
constexpr size_t index = i - 1;
if constexpr (index < data_size) {
return data[index];
} else {
return 0;
}
});
}



typedef data_t vector_t [[gnu::vector_size(data_size * sizeof(data_t))]];

// gnu vector extension version
vector_t permute_vector(vector_t data) {
return __builtin_shufflevector(data, vector_t{0}, 4, 0, 1, 2);
}
```

The code is compiled with the options `-O3 -march=x86-64-v3 -std=c++20`.
Although they should have the same functionality, generated assembly (by GCC)
is so different.

```asm
permute_simd(std::experimental::parallelism_v2::simd >):
  vmovq %xmm0, %rax
  vpsrldq $8, %xmm0, %xmm1
  vextracti128 $0x1, %ymm0, %xmm0
  vpunpcklqdq %xmm0, %xmm1, %xmm1
  vpxor %xmm0, %xmm0, %xmm0
  vpinsrq $1, %rax, %xmm0, %xmm0
  vinserti128 $0x1, %xmm1, %ymm0, %ymm0
  ret
permute_vector(unsigned long __vector(4)):
  vpxor %xmm1, %xmm1, %xmm1
  vpermq $144, %ymm0, %ymm0
  vpblendd $3, %ymm1, %ymm0, %ymm0
  ret
```

However, Clang can optimize `permute_simd` into the same assembly as
`permute_vector`, so I think, instead of a bug in the std::experimental::simd,
it is a missed optimization in GCC.

```asm
permute_simd(std::experimental::parallelism_v2::simd >): #
@permute_simd(std::experimental::parallelism_v2::simd >)
vpermpd $144, %ymm0, %ymm0  # ymm0 = ymm0[0,0,1,2]
vxorps  %xmm1, %xmm1, %xmm1
vblendps$3, %ymm1, %ymm0, %ymm0 # ymm0 =
ymm1[0,1],ymm0[2,3,4,5,6,7]
retq
permute_vector(unsigned long __vector(4)):#
@permute_vector(unsigned long __vector(4))
vpermpd $144, %ymm0, %ymm0  # ymm0 = ymm0[0,0,1,2]
vxorps  %xmm1, %xmm1, %xmm1
vblendps$3, %ymm1, %ymm0, %ymm0 # ymm0 =
ymm1[0,1],ymm0[2,3,4,5,6,7]
retq
```

[Bug target/114908] New: fails to optimize avx2 in-register permute written with std::experimental::simd

2024-05-01 Thread lee.imple at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114908

Bug ID: 114908
   Summary: fails to optimize avx2 in-register permute written
with std::experimental::simd
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: lee.imple at gmail dot com
  Target Milestone: ---

I am trying to write simd code with std::experimental::simd.
Here is the same function written in both std::experimental::simd and GNU
vector extension versions (available online at https://godbolt.org/z/dc169rY3o
).
The purpose is to permute the register from [w, x, y, z] into [0, w, x, y].

```c++
#include 
#include 
namespace stdx = std::experimental;

using data_t = std::uint64_t;
constexpr std::size_t data_size = 4;

template 
using simd_of = std::experimental::simd>;
using simd_t = simd_of;

template 
constexpr simd_of zero = {};

// stdx version
simd_t permute_simd(simd_t data) {
auto [carry, _] = split(data);
return concat(zero<1>, carry);
}



typedef data_t vector_t [[gnu::vector_size(data_size * sizeof(data_t))]];
constexpr vector_t zero_v = {0};

// gnu vector extension version
vector_t permute_vector(vector_t data) {
return __builtin_shufflevector(data, zero_v, 4, 0, 1, 2);
}
```

The code is compiled with the options `-O3 -march=x86-64-v3 -std=c++20`.
Although they should have the same functionality, generated assembly (by GCC)
is so different.

```asm
permute_simd(std::experimental::parallelism_v2::simd >):
  pushq %rbp
  vpxor %xmm1, %xmm1, %xmm1
  movq %rsp, %rbp
  andq $-32, %rsp
  subq $8, %rsp
  vmovdqa %ymm0, -120(%rsp)
  vmovdqa %ymm1, -56(%rsp)
  movq -104(%rsp), %rax
  vmovdqa %xmm0, -56(%rsp)
  movq -48(%rsp), %rdx
  movq $0, -88(%rsp)
  movq %rax, -40(%rsp)
  movq -56(%rsp), %rax
  vmovdqa -56(%rsp), %ymm2
  vmovq %rax, %xmm0
  vmovdqa %ymm2, -24(%rsp)
  movq -8(%rsp), %rax
  vpinsrq $1, %rdx, %xmm0, %xmm0
  vmovdqu %xmm0, -80(%rsp)
  movq %rax, -64(%rsp)
  vmovdqa -88(%rsp), %ymm0
  leave
  ret
permute_vector(unsigned long __vector(4)):
  vpxor %xmm1, %xmm1, %xmm1
  vpermq $144, %ymm0, %ymm0
  vpblendd $3, %ymm1, %ymm0, %ymm0
  ret
```

However, Clang can optimize `permute_simd` into the same assembly as
`permute_vector`, so I think, instead of a bug in the std::experimental::simd,
it is a missed optimization in GCC.

```asm
permute_simd(std::experimental::parallelism_v2::simd >): #
@permute_simd(std::experimental::parallelism_v2::simd >)
vpermpd $144, %ymm0, %ymm0  # ymm0 = ymm0[0,0,1,2]
vxorps  %xmm1, %xmm1, %xmm1
vblendps$3, %ymm1, %ymm0, %ymm0 # ymm0 =
ymm1[0,1],ymm0[2,3,4,5,6,7]
retq
permute_vector(unsigned long __vector(4)):#
@permute_vector(unsigned long __vector(4))
vpermpd $144, %ymm0, %ymm0  # ymm0 = ymm0[0,0,1,2]
vxorps  %xmm1, %xmm1, %xmm1
vblendps$3, %ymm1, %ymm0, %ymm0 # ymm0 =
ymm1[0,1],ymm0[2,3,4,5,6,7]
retq
```

[Bug libstdc++/114417] std::experimental::simd is not a POD (by ABI definitions) and is always passed by reference instead of by value

2024-04-22 Thread lee.imple at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114417

--- Comment #11 from Imple Lee  ---
> What you want to use instead is std::experimental::simd_abi::deduce_t. 
> That'll give you a not-fixed_size ABI if one exists. And those will likely be 
> passed via registers (as long as the psABI allows).

Great! It does work as intended. Thank you for telling me that.
Maybe all I need is just to read the docs on cppref more carefully :|

[Bug libstdc++/114417] std::experimental::simd is not a POD (by ABI definitions) and is always passed by reference instead of by value

2024-03-21 Thread lee.imple at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114417

--- Comment #7 from Imple Lee  ---
I tried to dig into the source code and it seems like it was designed to be
"passed via the stack". Not sure whether this was specified by the
specification (did not find relevant requirements, but I am not quite familiar
with that) or just an implementation choice.

In GCC git tree [libstdc++-v3/include/experimental/bits/simd_fixed_size.h, line
27](https://gcc.gnu.org/git?p=gcc.git;a=blob;f=libstdc%2B%2B-v3/include/experimental/bits/simd_fixed_size.h;h=408855212979cc32699db0805079ac74f495a8fa;hb=HEAD#l27):

...
  * The fixed_size ABI gives the following guarantees:
  *  - simd objects are passed via the stack
...

[Bug target/114417] simd parameters are passed by memory on x64 , not using the available sse registers

2024-03-21 Thread lee.imple at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114417

--- Comment #3 from Imple Lee  ---
Oh, I didn't make it clear. I am describing libstdc++'s std::experimental::simd
class.

[Bug target/114417] simd parameters are passed by memory on x64 , not using the available sse registers

2024-03-21 Thread lee.imple at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114417

--- Comment #2 from Imple Lee  ---
(In reply to Andrew Pinski from comment #1)
> I doubt this can change since this is the abi gcc decided on a long time ago.

If we implement the simd class as a wrapper around a vector, the parameter can
still be passed by sse registers, so I think there may be an implementation
issue in libstdc++'s implementation of stdx::simd.

https://godbolt.org/z/a6s67zzc7

[Bug libstdc++/114417] New: simd parameters are passed by memory on x64 , not using the available sse registers

2024-03-21 Thread lee.imple at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114417

Bug ID: 114417
   Summary: simd parameters are passed by memory on x64 , not
using the available sse registers
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: libstdc++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: lee.imple at gmail dot com
  Target Milestone: ---

https://godbolt.org/z/3GYnadqc1

In current implementation, SIMD parameters are passed by memory, while the
equivalent vector parameters are passed by SSE registers. If the equivalent
vector parameters can be passed by SSE registers, can we use SSE registers for
SIMD parameters?

Maybe the performance difference is not so significant, but I just want to keep
everything in registers.