https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96918
--- Comment #13 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
(In reply to Cory Fields from comment #11)
> Confirmed seeing this as well. Specifically in a vectorized chacha20
> implementation which performs left-rotates of uint32_t's.
>
> Targeting avx2, Clang optimizes the 8bit/16bit shifts to a vpshufb which
> performs significantly better than vpsrld+vpslld on my hardware.
>
> Minimal reproducer:
>
> using vec256 = unsigned __attribute__((__vector_size__(32)));
>
> template <unsigned BITS>
> void vec_rotl(vec256& vec)
> {
> vec = (vec << BITS) | (vec >> (32 - BITS));
> }
>
> template void vec_rotl<16>(vec256&);
> template void vec_rotl<8>(vec256&);
> template void vec_rotl<7>(vec256&);
>
> godbolt: https://godbolt.org/z/85j544EEf
Note, if you really want the permutation rather than shifts, you can always use
__builtin_shuffle for the BITS % 8 == 0 case as a workaround. But it will need
to be one with casts to unsigned char __attribute__((__vector_size__(32))) and
32 indices, always (4 - BITS / 8) & 3, (5 - BITS / 8) & 3, (6 - BITS / 8) & 3,
(7 - BITS / 8) & 3 (repeated 8 times).
I'm not sure it is a good idea to announce availability of rotlv8si3 and
similar patterns when there is no HW instruction for those and through
permutations we can only handle a subset of those (constant shift count
divisible by 8), and in fact only a subset of those and not all the
permutations are supported for all the ISAs.
Anyway, not a regression, so this will need to wait for GCC 17.