https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87599
--- Comment #6 from H.J. Lu <hjl.tools at gmail dot com> --- (In reply to Alexander Monakov from comment #5) > I think we should use punpcklqdq here rather than movddup, because (at least > on Intel) it has same latency, and same-or-better throughput. It may be ok > to use movddup when broadcasting from a memory source, but for reg-to-reg > broadcasting we really should prefer punpcklqdq. > > Why isn't IRA using the first alternative? If I tweak the testcase like this > I get the expected code, so why isn't it working properly without the asm? > > typedef long T __attribute__((vector_size(16))); > T f(long v) > { > asm("# %0" :: "x"(v)); > return (T){v, v}; > } > > gcc -O2 -mtune=intel -msse3 > > f: > movq %rdi, %xmm0 > #APP > # %xmm0 > #NO_APP > punpcklqdq %xmm0, %xmm0 > ret When SSE3 is enabled, memory source has lower cost since the SSE3 alternative doesn't allow register source.