https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96918
--- Comment #14 from Cory Fields <lists at coryfields dot com> ---
(In reply to Jakub Jelinek from comment #13)
> (In reply to Cory Fields from comment #11)
> Note, if you really want the permutation rather than shifts, you can always
> use __builtin_shuffle for the BITS % 8 == 0 case as a workaround. But it
> will need to be one with casts to unsigned char
> __attribute__((__vector_size__(32))) and 32 indices, always (4 - BITS / 8) &
> 3, (5 - BITS / 8) & 3, (6 - BITS / 8) & 3, (7 - BITS / 8) & 3 (repeated 8
> times).
> I'm not sure it is a good idea to announce availability of rotlv8si3 and
> similar patterns when there is no HW instruction for those and through
> permutations we can only handle a subset of those (constant shift count
> divisible by 8), and in fact only a subset of those and not all the
> permutations are supported for all the ISAs.
> Anyway, not a regression, so this will need to wait for GCC 17.
Yes, definitely not a regression.
The above does indeed work for avx2, thanks. Essentially:
if constexpr(BITS == 16) {
temp.u8 = __builtin_shufflevector(temp.u8, temp.u8,
2,3,0,1,6,7,4,5,10,11,8,9,14,15,12,13,\
18,19,16,17,22,23,20,21,26,27,24,25,30,31,28,29);
} else if constexpr(BITS == 8) {
temp.u8 = __builtin_shufflevector(temp.u8, temp.u8,
3,0,1,2,7,4,5,6,11,8,9,10,15,12,13,14,\
19,16,17,18,23,20,21,22,27,24,25,26,31,28,29,30);
}
But sadly, it's not generic. Without avx2 enabled, gcc emits a slow (non-sse2)
byte-for-byte shuffle.
What I'm hoping to achieve is an optimal 8bit and 16bit rotate for any
architecture. For what it's worth, clang is capable of this already. Whether
building for generic x86-64, avx, avx2, or avx512vl, it is able to generate the
optimal shifts from:
(vec << BITS) | (vec >> (32 - BITS)
it also supports a more explicit:
vec = __builtin_elementwise_fshl(vec, vec, vec256{BITS, BITS, BITS, BITS, BITS,
BITS, BITS, BITS})
Which results in the same output.
Is there any path to a gcc with generic vectorized rotation with optimal
rotation detection to match clang's? This is the last roadblock that prevents
an ideal generic vectorized chacha20 impl.