Re: [PATCH] tree-optimization/102155 - fix LIM fill_always_executed_in CFG walk

Xionghu Luo via Gcc-patches Wed, 08 Sep 2021 23:40:48 -0700



On 2021/9/2 18:37, Richard Biener wrote:

On Thu, 2 Sep 2021, Xionghu Luo wrote:



On 2021/9/2 16:50, Richard Biener wrote:

On Thu, 2 Sep 2021, Richard Biener wrote:

On Thu, 2 Sep 2021, Xionghu Luo wrote:



On 2021/9/1 17:58, Richard Biener wrote:

This fixes the CFG walk order of fill_always_executed_in to use
RPO oder rather than the dominator based order computed by
get_loop_body_in_dom_order.  That fixes correctness issues with
unordered dominator children.

The RPO order computed by rev_post_order_and_mark_dfs_back_seme in
its for-iteration mode is a good match for the algorithm.

Xionghu, I've tried to only fix the CFG walk order issue and not
change anything else with this so we have a more correct base
to work against.  The code still walks inner loop bodies
up to loop depth times and thus is quadratic in the loop depth.

Bootstrapped and tested on x86_64-unknown-linux-gnu, if you don't
have any comments I plan to push this and then revisit what we
were circling around.


LGTM, thanks.


I pushed it, thought again in the attempt to build a testcase and
concluded I was wrong with the appearant mishandling of
contains_call - get_loop_body_in_dom_order seems to be exactly
correct for this specific case.  So I reverted the commit again.


And I figured what the

                /* In a loop that is always entered we may proceed anyway.
                   But record that we entered it and stop once we leave it.
*/

comment was about.  The code was present before the fix for PR78185
and it was supposed to catch the case where the entered inner loop
is not finite.  Just as the testcase from PR78185 shows the
stopping was done too late when the exit block was already marked
as to be always executed.  A simpler fix for PR78185 would have been
to move

            if (!flow_bb_inside_loop_p (inn_loop, bb))
              break;

before setting of last = bb.  In fact the installed fix was more
pessimistic than that given it terminated already when entering
a possibly infinite loop.  So we can improve that by doing
sth like which should also improve the situation for some of
the cases you were looking at?

What remains is that we continue to stop when entering a
not always executed loop:

            if (bb->loop_father->header == bb)
              {
                if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
                  break;


Yes.  This will cause blocks after inner loop missed to be check
if they are actually ALWAYS_EXECUTED.   I am afraid O(N^2) is
inevitable here...


Yes.  What we can try is pre-computing whether a loop
has a call or an inner loop that might not terminate and then
when that's true for the loop to be entered continue to break;
but when not, skip processing that loop blocks (but we still
fill the blocks array, and we do need to do this in the order
for the loop we're processing ...).

So what I was thinking was to somehow embed the dominator
walk of get_loop_body_in_dom_order and instead of pre-recording
the above info (call, infinite loop) for loops, pre-record
it on the dominator tree so that we can ask "in any of our
dominator children, is there a call or an infinite loop" and
thus cut the dominator walk at loop header blocks that are
not dominating the outer loop latch ...

Of course the simplistic solution might be to simply do

    if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb)
        && ((loop_depth (bb->loop_father) - loop_depth (loop))
            > param_max_lim_loop_depth_lookahead)))
      break;

and thus limit the processing of conditionally executed inner
loops by relative depth ... as you say the actual processing
is unlikely to be the bottleneck for the degenerate cases
of a very deep nest of conditionally executed loops.

But still for this case get_loop_body_in_dom_order is
doing quadratic processing so we can also say that
another linear walk over the produced array does not

increase complexity.>


volatile int flag, bar;
double foo (double *valp)
{
    double sum = 0;
    for (int i = 0; i < 256; ++i)
      {
        for (int j = 0; j < 256; ++j)
          bar = flag;
        if (flag)
          sum += 1.;
        sum += *valp;
      }
    return sum;
}


The patch still fails to handle cases like this:


struct X { int i; int j; int k;};
volatile int m;

void bar (struct X *x, int n, int l, int k)
{
   for (int i = 0; i < l; i++)
     {
      if (k)
         for (int j = 0; j < l; j++)
           {
             if (m)
               x->i = m;
             else
               x->i = 1 - m;

             int *r = &x->k;
             int tem2 = *r;
             x->k += tem2 * j;
         }

       x->i = m;
     }
}

x->i is still not marked ALWAYS_EXECUTED for outer loop.


Collected data when build gcc stage1 and bootstrap.  There are still
about 9% bbs are missed to be marked with ALWAYS_EXECUTED.  Execution time
of fill_always_executed_in_1 is increased obvious in stage1 but it is in
mini-second level while bootstrap build doesn't show big time change.

base vs. incoming_edge_counter_patch:

1. For stage1: 2214/2216 functions handled in both lim2 and lim4 with:
    -  bbs find:  base vs patched  "always executed":  17666  vs. 19278.
    - total cost time of fill_always_executed_in_1 (usec):  30574 vs. 51238.
2. bootstrap: 33910/33951 functions handled in both lim2 and lim4 with:
    - bbs find:  base vs patched  "always executed":  201171 vs. 222942.
    - total cost time of fill_always_executed_in_1 (usec):  987054  vs. 
1061821.5

Do we still have plan to replace the get_loop_body_in_dom_order with
linear search or just "limit the processing of conditionally executed inner
loops by relative depth"? ;)



ALWAYS_EXECUTED_IN is not computed completely for nested loops.
Current design will exit if an inner loop doesn't dominate outer
loop's latch or exit after exiting from inner loop, which
caused early return from outer loop, then ALWAYS EXECUTED blocks after
inner loops are skipped.  Another problem is it has to call
get_loop_body_in_dom_order in each recursive call which is also
inefficient.

This patch does greedy visiting of the loop header successors,
processing further blocks if all entries (ignoring backedges) are
processed, setting SET_ALWAYS_EXECUTED_IN if dominates loop's latch.
When the worklist is empty proceed to inner loops.  For bookkeeping
simply keep a to-visit-incoming-edges counter per BB, and pass it
through to inner loops.

Details discusions:
https://gcc.gnu.org/pipermail/gcc-patches/2021-August/577743.html

gcc/ChangeLog:

        * tree-ssa-loop-im.c (fill_always_executed_in_1): Remove
        inn_loop check.  Greedy visit loop header successors and update
        edge counts.
        (fill_always_executed_in): Compute bb_incoming_edges and pass
        down.
        (loop_invariant_motion_in_fun): Compute dfs back edge.

gcc/testsuite/ChangeLog:

        * gcc.dg/tree-ssa/ssa-lim-19.c: New test.
        * gcc.dg/tree-ssa/ssa-lim-20.c: New test.
---
 gcc/tree-ssa-loop-im.c                     | 113 +++++++++++----------
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c |  31 ++++++
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c |  30 ++++++
 3 files changed, 122 insertions(+), 52 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c

diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
index b0582c047a1..e9ce18f822f 100644
--- a/gcc/tree-ssa-loop-im.c
+++ b/gcc/tree-ssa-loop-im.c
@@ -3192,79 +3192,65 @@ do_store_motion (void)
    blocks that contain a nonpure call.  */

static void

-fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
+fill_always_executed_in_1 (class loop *loop, sbitmap contains_call, int *bbi)
 {
-  basic_block bb = NULL, *bbs, last = NULL;
-  unsigned i;
+  basic_block bb = NULL;
   edge e;
-  class loop *inn_loop = loop;

if (ALWAYS_EXECUTED_IN (loop->header) == NULL)

     {
-      bbs = get_loop_body_in_dom_order (loop);
+      auto_bitmap work_set;
+      bitmap_set_bit (work_set, loop->header->index);
+      unsigned bb_index;

- for (i = 0; i < loop->num_nodes; i++)

-       {
-         edge_iterator ei;
-         bb = bbs[i];
+      unsigned array_size = last_basic_block_for_fn (cfun) + 1;
+      int *bbd = XCNEWVEC (int, array_size);
+      bbd = XDUPVEC (int, bbi, array_size);

- if (!flow_bb_inside_loop_p (inn_loop, bb))

+      while (!bitmap_empty_p (work_set))
+       {
+         bb_index = bitmap_first_set_bit (work_set);
+         bitmap_clear_bit (work_set, bb_index);
+         bb = BASIC_BLOCK_FOR_FN (cfun, bb_index);
+         if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
            {
-             /* When we are leaving a possibly infinite inner loop
-                we have to stop processing.  */
-             if (!finite_loop_p (inn_loop))
-               break;
-             /* If the loop was finite we can continue with processing
-                the loop we exited to.  */
-             inn_loop = bb->loop_father;
+             SET_ALWAYS_EXECUTED_IN (bb, loop);
+             if (dump_enabled_p ())
+               dump_printf (MSG_NOTE, "BB %d is always executed in loop %d\n",
+                   bb->index, loop->num);
            }
-
-         if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
-           last = bb;
-
          if (bitmap_bit_p (contains_call, bb->index))
            break;

- /* If LOOP exits from this BB stop processing. */

-         FOR_EACH_EDGE (e, ei, bb->succs)
-           if (!flow_bb_inside_loop_p (loop, e->dest))
-             break;
-         if (e)
-           break;
-
          /* A loop might be infinite (TODO use simple loop analysis
             to disprove this if possible).  */
          if (bb->flags & BB_IRREDUCIBLE_LOOP)
            break;

- if (bb->loop_father->header == bb)

+         edge_iterator ei;
+         FOR_EACH_EDGE (e, ei, bb->succs)
            {
-             if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
+             if (e->dest == loop->latch)
+               continue;
+             /* If there is an exit from this BB.  */
+             if (!flow_bb_inside_loop_p (loop, e->dest))
+               break;
+             /* Or we enter a possibly non-finite loop.  */
+             if (flow_loop_nested_p (bb->loop_father,
+                                     e->dest->loop_father)
+                 && ! finite_loop_p (e->dest->loop_father))
                break;

- /* In a loop that is always entered we may proceed anyway.

-                But record that we entered it and stop once we leave it
-                since it might not be finite.  */
-             inn_loop = bb->loop_father;
+             bbd[e->dest->index]--;
+             if (bbd[e->dest->index] == 0)
+               bitmap_set_bit (work_set, e->dest->index);
            }
-       }
-
-      while (1)
-       {
-         if (dump_enabled_p ())
-           dump_printf (MSG_NOTE, "BB %d is always executed in loop %d\n",
-                        last->index, loop->num);
-         SET_ALWAYS_EXECUTED_IN (last, loop);
-         if (last == loop->header)
+         if (e)
            break;
-         last = get_immediate_dominator (CDI_DOMINATORS, last);
        }
-
-      free (bbs);
+      free (bbd);
     }
-
-  for (loop = loop->inner; loop; loop = loop->next)
-    fill_always_executed_in_1 (loop, contains_call);
 }

/* Fills ALWAYS_EXECUTED_IN information for basic blocks, i.e.

@@ -3275,12 +3261,22 @@ static void
 fill_always_executed_in (void)
 {
   basic_block bb;
-  class loop *loop;
+  edge_iterator ei;
+  edge e;
+
+  unsigned array_size = last_basic_block_for_fn (cfun) + 1;
+  int *bb_incoming_edges = XCNEWVEC (int, array_size);

auto_sbitmap contains_call (last_basic_block_for_fn (cfun));

   bitmap_clear (contains_call);
   FOR_EACH_BB_FN (bb, cfun)
     {
+      int len = bb->preds->length ();
+      FOR_EACH_EDGE (e, ei, bb->preds)
+       if (e->flags & EDGE_DFS_BACK)
+         len--;
+      bb_incoming_edges[bb->index] = len;
+
       gimple_stmt_iterator gsi;
       for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
        {
@@ -3292,8 +3288,19 @@ fill_always_executed_in (void)
        bitmap_set_bit (contains_call, bb->index);
     }

- for (loop = current_loops->tree_root->inner; loop; loop = loop->next)

-    fill_always_executed_in_1 (loop, contains_call);
+
+  for (auto loop : loops_list (cfun, 0))
+    fill_always_executed_in_1 (loop, contains_call, bb_incoming_edges);
 }

@@ -3396,6 +3403,8 @@ loop_invariant_motion_in_fun (function *fun, bool store_motion)

   /* Gathers information about memory accesses in the loops.  */
   analyze_memory_references (store_motion);

+ mark_dfs_back_edges ();

+
   /* Fills ALWAYS_EXECUTED_IN information for basic blocks.  */
   fill_always_executed_in ();

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c

new file mode 100644
index 00000000000..3d5fe6f0314
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
@@ -0,0 +1,31 @@
+/* PR/101293 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+struct X { int i; int j; int k;};
+
+void foo(struct X *x, int n, int l, int m)
+{
+  for (int j = 0; j < l; j++)
+    {
+      for (int i = 0; i < n; ++i)
+       {
+         if (m)
+           x->j++;
+         else
+           x->j = m+n+l;
+
+         int *p = &x->j;
+         int tem = *p;
+         x->j += tem * i;
+       }
+      int *r = &x->k;
+      int tem2 = *r;
+      x->k += tem2 * j;
+    }
+}
+
+/* { dg-final { scan-tree-dump-times "Executing store motion" 2 "lim2" } } */
+
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
new file mode 100644
index 00000000000..6e180de528e
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
@@ -0,0 +1,30 @@
+/* PR/101293 */
+/* { dg-do run } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+void __attribute__ ((noipa)) foo (int n, int m, int f, int *p, int *q)
+{
+  while (--n)
+    {
+      do
+       {
+         *q += 1;
+         if (f)
+           goto out;
+       }
+      while (--m);
+      *p += 1;
+    }
+out:;
+}
+
+int main()
+{
+    int i = 0;
+    foo (10, 10, 1, (void *) 0, &i);
+    if (i != 1)
+      __builtin_abort ();
+    return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Executing store motion" 1 "lim2" } } */
--
2.27.0.90.geebb51ba8c

Re: [PATCH] tree-optimization/102155 - fix LIM fill_always_executed_in CFG walk

Reply via email to