https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80381
--- Comment #9 from Uroš Bizjak <ubizjak at gmail dot com> --- I was looking at generated code (with -mtune=intel): vpbroadcastd %edi, %zmm0 # 9 *avx512f_vec_dup_gprv16si/2 [length = 6] movl %edi, %edi # 12 *zero_extendsidi2/4 [length = 2] vmovq %rdi, %xmm1 # 26 *movdi_internal/20 [length = 6] vpsrad %xmm1, %zmm0, %zmm0 # 17 ashrv16si3/1 [length = 6] ret # 29 simple_return_internal [length = 1] (insn 12) and (insn 26) could be merged to vmovd %edx, %xmm0 # 13 *zero_extendsidi2/10 [length = 6] Register allocator somehow avoids zero-extension to SSE reg in (insn 12) and generates input reload (insn 26) for (insn 17): Inserting insn reload before: 26: r107:DI=r103:DI ... Choosing alt 19 in insn 26: (0) ?*Yi (1) r {*movdi_internal} RA could choose the same (?*Yi, r) alternative in the (insn 12). REE pass also doesn't merge (insn 12) and (insn 26).