https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115833
Hongtao Liu <liuhongt at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |liuhongt at gcc dot gnu.org --- Comment #3 from Hongtao Liu <liuhongt at gcc dot gnu.org> --- > It seems the very bad code generation is mostly from constructing the > V4HImode vectors going via GPRs with shifts and ORs. Possibly > constructing a V4SImode vector and then packing to V4HImode would be > better? void v4hi_contruct(signed short *t, signed short tt, short tt1) { t[0] = tt; t[1] = tt1; t[2] = tt1; t[3] = tt1; } void v4si_contruct(int *t, int tt, int tt2) { t[0] = tt; t[1] = tt2; t[2] = tt2; t[3] = tt2; } v4hi_contruct(short*, short, short): movzx eax, dx movzx esi, si mov rdx, rax sal rdx, 16 or rdx, rax sal rdx, 16 or rdx, rax sal rdx, 16 or rdx, rsi mov QWORD PTR [rdi], rdx ret v4si_contruct(int*, int, int): vmovd xmm2, edx vmovd xmm3, esi vpinsrd xmm1, xmm2, edx, 1 vpinsrd xmm0, xmm3, edx, 1 vpunpcklqdq xmm0, xmm0, xmm1 vmovdqu XMMWORD PTR [rdi], xmm0 ret both vmovd and vpinsrd is expensive, and v4hi_contruct is not necessary worse than v4si_construct, but v4hi_construct can be optimized to be a little more parallel via GPRs.