https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108041
Bug ID: 108041 Summary: ivopts results in extra instruction in simple loop Product: gcc Version: 13.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: law at gcc dot gnu.org CC: rzinsly at ventanamicro dot com Target Milestone: --- ivopts seems to make a bit of a mess out of this code resulting in the loop having an unnecessary instruction. Compile with rv64 -O2: typedef struct network { long nr_group, full_groups, max_elems; } network_t; void marc_arcs(network_t* net) { while (net->full_groups < 0) { net->full_groups = net->nr_group + net->full_groups; net->max_elems--; } } After slp1 we have this loop: ;; basic block 3, loop depth 0 ;; pred: 2 _1 = net_8(D)->nr_group; net__max_elems_lsm.4_16 = net_8(D)->max_elems; ;; succ: 4 ;; basic block 4, loop depth 1 ;; pred: 7 ;; 3 # _13 = PHI <_2(7), _11(3)> # net__max_elems_lsm.4_5 = PHI <_4(7), net__max_elems_lsm.4_16(3)> _2 = _1 + _13; _4 = net__max_elems_lsm.4_5 + -1; if (_2 < 0) goto <bb 7>; [89.00%] else goto <bb 5>; [11.00%] ;; succ: 7 ;; 5 ;; basic block 7, loop depth 1 ;; pred: 4 goto <bb 4>; [100.00%] ;; succ: 4 ;; basic block 5, loop depth 0 ;; pred: 4 # _12 = PHI <_2(4)> # _17 = PHI <_4(4)> net_8(D)->full_groups = _12; net_8(D)->max_elems = _17; ;; succ: 6 Of particular interest is the max_elems computation into _4. We accumulate it in the loop, then do the final store after the loop (thank you LSM!). After ivopts we have: ;; basic block 3, loop depth 0 ;; pred: 2 _1 = net_8(D)->nr_group; net__max_elems_lsm.4_16 = net_8(D)->max_elems; _22 = net__max_elems_lsm.4_16 + -1; ivtmp.10_21 = (unsigned long) _22; ;; succ: 4 ;; basic block 4, loop depth 1 ;; pred: 7 ;; 3 # _13 = PHI <_2(7), _11(3)> # ivtmp.10_3 = PHI <ivtmp.10_18(7), ivtmp.10_21(3)> _2 = _1 + _13; _4 = (long int) ivtmp.10_3; ivtmp.10_18 = ivtmp.10_3 - 1; if (_2 < 0) goto <bb 7>; [89.00%] else goto <bb 5>; [11.00%] ;; succ: 7 ;; 5 ;; basic block 7, loop depth 1 ;; pred: 4 goto <bb 4>; [100.00%] ;; succ: 4 ;; basic block 5, loop depth 0 ;; pred: 4 # _12 = PHI <_2(4)> # _17 = PHI <_4(4)> net_8(D)->full_groups = _12; net_8(D)->max_elems = _17; ;; succ: 6 Note the introduction of the IV and its relationship to _4. Essentially we compute both in the loop even _4 is always one greater than the IV. Worse yet, the IV is only used to compute _4! And since they differ by 1, we actually compute both and keep them alive resulting in this final code for rv64: .L3: add a5,a5,a2 mv a3,a4 addi a4,a4,-1 blt a5,zero,.L3 sd a5,8(a0) sd a3,16(a0) Note how we had to "stash away" the value of a4 before the decrement so that we could store it after the loop. The induction variable doesn't really buy us anything in this loop -- it's actively harmful. Not using the IV would probably be best. Second best would be to realize that _4 (aka a3) can be derived from the IV (a4) after the loop by adding 1.