https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63525
Bug ID: 63525 Summary: unnecessary reloads generated in loop Product: gcc Version: 5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: wmi at google dot com CC: vmakarov at gcc dot gnu.org Created attachment 33700 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=33700&action=edit testcase 1.cxx For the testcase 1.cxx attached, trunk (r214579) generates an addpd with mem operand and one extra reload insn in kernel loop. For g++ before r204274, it generate less insns in the kernel loop. ~/workarea/gcc-r214579/build/install/bin/g++ -O2 -S 1.cxx -o 1.s kernel loop: .L3: pxor %xmm0, %xmm0 cvtsi2sd %eax, %xmm0 addl $1, %eax cmpl %edx, %eax unpcklpd %xmm0, %xmm0 addpd -24(%rsp), %xmm0 ===> mem operand used movaps %xmm0, -24(%rsp) ===> reload jne .L3 ~/workarea/gcc-r199418/build/install/bin/g++ -O2 -S 1.cxx -o 2.s kernel loop: .L3: xorpd %xmm1, %xmm1 cvtsi2sd %eax, %xmm1 addl $1, %eax unpcklpd %xmm1, %xmm1 addpd %xmm1, %xmm0 cmpl %edx, %eax jne .L3 The reload insns in trunk are generated because of following steps: With r204274, the IR after expand like this: Loop: ... (insn 15 14 16 5 (set (reg/v:V2DF 83 [ v ]) (plus:V2DF (reg/v:V2DF 83 [ v ]) (reg:V2DF 92 [ D.5005 ]))) 1.cxx:14 -1 (nil)) ... end Loop. (insn 23 22 24 7 (set (reg/v:TI 90 [ tmp ]) (subreg:TI (reg/v:V2DF 83 [ v ]) 0)) /usr/local/google/home/wmi/workarea/gcc-r212442/build/install/lib/gcc/x86_64-unknown-linux-gnu/4.10.0/include/emmintrin.h:157 -1 (nil)) (insn 24 23 25 7 (set (mem/c:DF (symbol_ref:DI ("x") [flags 0x2] <var_decl 0x7ffff5c6d5a0 x>) [2 x+0 S8 A64]) (subreg:DF (reg/v:TI 90 [ tmp ]) 0)) 1.cxx:17 -1 (nil)) (insn 25 24 0 7 (set (mem/c:DF (symbol_ref:DI ("y") [flags 0x2] <var_decl 0x7ffff5c6d630 y>) [2 y+0 S8 A64]) (subreg:DF (reg/v:TI 90 [ tmp ]) 8)) 1.cxx:18 -1 (nil)) forward propagation will propagate reg 90 from insn 23 to insn 24 and insn 25, and remove subreg:TI, so we get the IR before IRA like this: Loop: ... (insn 15 14 16 4 (set (reg/v:V2DF 83 [ v ]) (plus:V2DF (reg/v:V2DF 83 [ v ]) (reg:V2DF 92 [ D.5005 ]))) 1.cxx:14 1263 {*addv2df3} (expr_list:REG_DEAD (reg:V2DF 92 [ D.5005 ]) (nil))) ... end Loop. (insn 24 22 25 5 (set (mem/c:DF (symbol_ref:DI ("x") [flags 0x2] <var_decl 0x7ffff5c6d5a0 x>) [2 x+0 S8 A64]) (subreg:DF (reg/v:V2DF 83 [ v ]) 0)) 1.cxx:17 128 {*movdf_internal} (nil)) (insn 25 24 0 5 (set (mem/c:DF (symbol_ref:DI ("y") [flags 0x2] <var_decl 0x7ffff5c6d630 y>) [2 y+0 S8 A64]) (subreg:DF (reg/v:V2DF 83 [ v ]) 8)) 1.cxx:18 128 {*movdf_internal} (expr_list:REG_DEAD (reg/v:V2DF 83 [ v ]) (nil))) ix86_cannot_change_mode_class doesn't allow such subreg: "subreg:DF (reg/v:V2DF 83 [ v ]) 8)" in insn 25, so reg 83 will be added in invalid_mode_changes by record_subregs_of_mode and will be allocated NO_REGS regclass. reg 83 has NO_REGS regclass while plus:V2DF requires the target operand to be xmm register in insn 15, so reload insns are needed. The kernel loop has low register pressure and it doesn't form a separate IRA region, so live range splitting on region boarder doesn't kick in here. Without r204274, IR after expand is like this: Loop: ... (insn 15 14 16 5 (set (reg/v:V2DF 61 [ v ]) (plus:V2DF (reg/v:V2DF 61 [ v ]) (reg:V2DF 68 [ D.4966 ]))) 1.cxx:14 -1 (nil)) ... End Loop. (insn 25 24 26 7 (set (subreg:V2DF (reg/v:TI 66 [ tmp ]) 0) (reg/v:V2DF 61 [ v ])) /usr/local/google/home/wmi/workarea/gcc-r199418/build/install/lib/gcc/x86_64-unknown-linux-gnu/4.9.0/include/emmintrin.h:147 -1 (nil)) (insn 26 25 27 7 (set (mem/c:DF (symbol_ref:DI ("x") [flags 0x2] <var_decl 0x7ffff5e80be0 x>) [2 x+0 S8 A64]) (subreg:DF (reg/v:TI 66 [ tmp ]) 0)) 1.cxx:17 -1 (nil)) (insn 27 26 0 7 (set (mem/c:DF (symbol_ref:DI ("y") [flags 0x2] <var_decl 0x7ffff5e80c78 y>) [2 y+0 S8 A64]) (subreg:DF (reg/v:TI 66 [ tmp ]) 8)) 1.cxx:18 -1 (nil)) Because the subreg is on the left handside of insn 25, it is impossible for forward propagation to merge insn 25 to insn 26 and insn 27. reg 61 will not have reference like this: "subreg:DF (reg/v:V2DF 61 [ v ]) 8)", so it gets SSE regclass and will not introduce extra reload insns in kernel loop. r204274 just enables more forward propagations and exposes the problem here.