https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99434
Jakub Jelinek <jakub at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |jakub at gcc dot gnu.org, | |jamborm at gcc dot gnu.org, | |vmakarov at gcc dot gnu.org Status|UNCONFIRMED |NEW Ever confirmed|0 |1 Last reconfirmed| |2021-03-08 --- Comment #4 from Jakub Jelinek <jakub at gcc dot gnu.org> --- The umul5 case in #c0 is worse because of SRA. With -O2 -fno-tree-sra optimized dump looks like: _3 = a_4(D) w* b_5(D); D.2396 = VIEW_CONVERT_EXPR<struct u64x2_t>(_3); D.2383 = D.2396; return D.2383; and _3 = a_4(D) w* b_5(D); D.2389 = VIEW_CONVERT_EXPR<struct u64x2_t>(_3); return D.2389; for the two functions and even when there is the superfluous copying we emit the same assembly. But with SRA the former becomes: _3 = a_4(D) w* b_5(D); D.2396 = VIEW_CONVERT_EXPR<struct u64x2_t>(_3); SR.6_12 = D.2396.low; SR.7_13 = D.2396.high; D.2383.low = SR.6_12; D.2383.high = SR.7_13; return D.2383; In the -fno-tree-sra case the IL just contains one extra TImode pseudo -> pseudo assignment which is shortly optimized away, so we have just: (insn 7 4 13 2 (parallel [ (set (reg:TI 87) (mult:TI (zero_extend:TI (reg:DI 89)) (zero_extend:TI (reg:DI 90)))) (clobber (reg:CC 17 flags)) ]) "pr99434.C":23:66 426 {*umulditi3_1} (expr_list:REG_DEAD (reg:DI 90) (expr_list:REG_DEAD (reg:DI 89) (expr_list:REG_UNUSED (reg:CC 17 flags) (nil))))) (insn 13 7 14 2 (set (reg/i:TI 0 ax) (reg:TI 87)) "pr99434.C":24:1 73 {*movti_internal} (expr_list:REG_DEAD (reg:TI 87) (nil))) (insn 14 13 0 2 (use (reg/i:TI 0 ax)) "pr99434.C":24:1 -1 (nil)) before reload, while with SRA we have: (insn 7 4 19 2 (parallel [ (set (reg:TI 90) (mult:TI (zero_extend:TI (reg:DI 98)) (zero_extend:TI (reg:DI 99)))) (clobber (reg:CC 17 flags)) ]) "pr99434.C":18:57 426 {*umulditi3_1} (expr_list:REG_DEAD (reg:DI 99) (expr_list:REG_DEAD (reg:DI 98) (expr_list:REG_UNUSED (reg:CC 17 flags) (nil))))) (insn 19 7 20 2 (set (reg:DI 92 [ D.2396 ]) (subreg:DI (reg:TI 90) 0)) "pr99434.C":5:40 74 {*movdi_internal} (nil)) (insn 20 19 23 2 (set (reg:DI 93 [ D.2396+8 ]) (subreg:DI (reg:TI 90) 8)) "pr99434.C":5:40 74 {*movdi_internal} (expr_list:REG_DEAD (reg:TI 90) (nil))) (insn 23 20 24 2 (set (reg:DI 0 ax) (reg:DI 92 [ D.2396 ])) "pr99434.C":19:1 74 {*movdi_internal} (expr_list:REG_DEAD (reg:DI 92 [ D.2396 ]) (nil))) (insn 24 23 17 2 (set (reg:DI 1 dx [+8 ]) (reg:DI 93 [ D.2396+8 ])) "pr99434.C":19:1 74 {*movdi_internal} (expr_list:REG_DEAD (reg:DI 93 [ D.2396+8 ]) (nil))) (insn 17 24 0 2 (use (reg/i:TI 0 ax)) "pr99434.C":19:1 -1 (nil)) While in both cases we get the same (right) RA decisions about the umulditi3_1, and the IRA decisions seems to be good too: Popping a2(r90,l0) -- assign reg 0 Popping a4(r98,l0) -- assign reg 5 Popping a0(r93,l0) -- assign reg 1 Popping a1(r92,l0) -- assign reg 0 Popping a3(r99,l0) -- assign reg 4 for some reason LRA then decides to use different registers...