https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117007
--- Comment #15 from Steven Munroe <munroesj at gcc dot gnu.org> ---
Found where handling of vec_splat_u32 constant shift counts are handled
differently across the various shift/rotate intrinsics.
Even for the 5-bit shift counts (the easy case) the behavior of the various
shift/rotate intrinsic are inconsistent. The compiler pays way to much
attention to how the shift count is generated but differently between shift
left/right word and different again for rotate left word.
Any reasonable person would assume that using vec_splat_u32() for any shift
value 1 to 31 (-16 to 15) will generate efficient code. And it does for
vec_vslw() which generates two instructions (vspltisw v0,-16; vslw v2,v2,v0).
But the compiler behaves differently for vec_vsrw() and vec_vsraw():
- for values 1-15 generates:
- vspltisw v0,15; vsrw v2,v2,v0
- for even values between 16 - 30
- vspltisw v0,8; vadduwm v0,v0,v0; vsrw v2,v2,v0
- for odd values between 17 - 31 generates a load for .rodata
And positively strange for vec_vrlw():
- for values 1-15 it generates:
- vspltisw v0,15; vrlw v2,v2,v0
- but for any value between 16 - 31 it gets strange:
0000000000001200 <test_rlwi_16>:
1200: 30 00 20 39 li r9,48
1204: 8c 03 00 10 vspltisw v0,0
1208: 67 01 29 7c mtvrd v1,r9
120c: 93 0a 21 f0 xxspltw vs33,vs33,1
1210: 80 0c 00 10 vsubuwm v0,v0,v1
1214: 84 00 42 10 vrlw v2,v2,v0
1218: 20 00 80 4e blr