[Bug target/96918] Failure to optimize vector shift left+shift right+or to pshuf

jakub at gcc dot gnu.org via Gcc-bugs Wed, 14 Jan 2026 08:32:38 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96918


--- Comment #13 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
(In reply to Cory Fields from comment #11)
> Confirmed seeing this as well. Specifically in a vectorized chacha20
> implementation which performs left-rotates of uint32_t's.
> 
> Targeting avx2, Clang optimizes the 8bit/16bit shifts to a vpshufb which
> performs significantly better than vpsrld+vpslld on my hardware.
> 
> Minimal reproducer:
> 
> using vec256 = unsigned __attribute__((__vector_size__(32)));
> 
> template <unsigned BITS>
> void vec_rotl(vec256& vec)
> {
>     vec = (vec << BITS) | (vec >> (32 - BITS));
> }
> 
> template void vec_rotl<16>(vec256&);
> template void vec_rotl<8>(vec256&);
> template void vec_rotl<7>(vec256&);
> 
> godbolt: https://godbolt.org/z/85j544EEf

Note, if you really want the permutation rather than shifts, you can always use
__builtin_shuffle for the BITS % 8 == 0 case as a workaround.  But it will need
to be one with casts to unsigned char __attribute__((__vector_size__(32))) and
32 indices, always (4 - BITS / 8) & 3, (5 - BITS / 8) & 3, (6 - BITS / 8) & 3,
(7 - BITS / 8) & 3 (repeated 8 times).
I'm not sure it is a good idea to announce availability of rotlv8si3 and
similar patterns when there is no HW instruction for those and through
permutations we can only handle a subset of those (constant shift count
divisible by 8), and in fact only a subset of those and not all the
permutations are supported for all the ISAs.
Anyway, not a regression, so this will need to wait for GCC 17.

[Bug target/96918] Failure to optimize vector shift left+shift right+or to pshuf

Reply via email to