https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108041

            Bug ID: 108041
           Summary: ivopts results in extra instruction in simple loop
           Product: gcc
           Version: 13.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: law at gcc dot gnu.org
                CC: rzinsly at ventanamicro dot com
  Target Milestone: ---

ivopts seems to make a bit of a mess out of this code resulting in the loop
having an unnecessary instruction.  Compile with rv64 -O2:

typedef struct network
{
  long nr_group, full_groups, max_elems;
} network_t;
void marc_arcs(network_t* net)
{
  while (net->full_groups < 0) {
    net->full_groups = net->nr_group + net->full_groups;
    net->max_elems--;
  }
}





After slp1 we have this loop:
;;   basic block 3, loop depth 0
;;    pred:       2
  _1 = net_8(D)->nr_group;
  net__max_elems_lsm.4_16 = net_8(D)->max_elems;
;;    succ:       4

;;   basic block 4, loop depth 1
;;    pred:       7
;;                3
  # _13 = PHI <_2(7), _11(3)>
  # net__max_elems_lsm.4_5 = PHI <_4(7), net__max_elems_lsm.4_16(3)>
  _2 = _1 + _13;
  _4 = net__max_elems_lsm.4_5 + -1;
  if (_2 < 0)
    goto <bb 7>; [89.00%]
  else
    goto <bb 5>; [11.00%]
;;    succ:       7
;;                5

;;   basic block 7, loop depth 1
;;    pred:       4
  goto <bb 4>; [100.00%]
;;    succ:       4

;;   basic block 5, loop depth 0
;;    pred:       4
  # _12 = PHI <_2(4)>
  # _17 = PHI <_4(4)>
  net_8(D)->full_groups = _12;
  net_8(D)->max_elems = _17;
;;    succ:       6


Of particular interest is the max_elems computation into _4.  We accumulate it
in the loop, then do the final store after the loop (thank you LSM!).  After
ivopts we have:


;;   basic block 3, loop depth 0
;;    pred:       2
  _1 = net_8(D)->nr_group;
  net__max_elems_lsm.4_16 = net_8(D)->max_elems;
  _22 = net__max_elems_lsm.4_16 + -1;
  ivtmp.10_21 = (unsigned long) _22;
;;    succ:       4

;;   basic block 4, loop depth 1
;;    pred:       7
;;                3
  # _13 = PHI <_2(7), _11(3)>
  # ivtmp.10_3 = PHI <ivtmp.10_18(7), ivtmp.10_21(3)>
  _2 = _1 + _13;
  _4 = (long int) ivtmp.10_3;
  ivtmp.10_18 = ivtmp.10_3 - 1;
  if (_2 < 0)
    goto <bb 7>; [89.00%]
  else
    goto <bb 5>; [11.00%]
;;    succ:       7
;;                5

;;   basic block 7, loop depth 1
;;    pred:       4 
  goto <bb 4>; [100.00%]
;;    succ:       4

;;   basic block 5, loop depth 0
;;    pred:       4
  # _12 = PHI <_2(4)>
  # _17 = PHI <_4(4)>
  net_8(D)->full_groups = _12;
  net_8(D)->max_elems = _17;
;;    succ:       6

Note the introduction of the IV and its relationship to _4.  Essentially we
compute both in the loop even _4 is always one greater than the IV.  Worse yet,
the IV is only used to compute _4!  And since they differ by 1, we actually
compute both and keep them alive resulting in this final code for rv64:




.L3:
        add     a5,a5,a2
        mv      a3,a4
        addi    a4,a4,-1
        blt     a5,zero,.L3
        sd      a5,8(a0)
        sd      a3,16(a0)


Note how we had to "stash away" the value of a4 before the decrement so that we
could store it after the loop.  The induction variable doesn't really buy us
anything in this loop -- it's actively harmful.  Not using the IV would
probably be best.  Second best would be to realize that _4 (aka a3) can be
derived from the IV (a4) after the loop by adding 1.

Reply via email to