Re: [RFC] Design and Implementation for Path Splitting for Loop with Conditional IF-THEN-ELSE

Jeff Law Fri, 29 May 2015 08:54:52 -0700

On 05/16/2015 06:49 AM, Ajit Kumar Agarwal wrote:

I have Designed and  implemented with the following design for the path 
splitting of the loops with conditional IF-THEN-ELSE.
The implementation has gone through the bootstrap for Microblaze target along 
DEJA GNU regressions tests and
running the MIBench/EEMBC benchmarks. There is no regression seen in Deja GNU 
tests and Deja C++ tests results are
better with the path splitting changes(lesser failures). The Mibench/EEMBC 
benchmarks are run and no performance degradation is
seen and the performance improvement in telcom_gsm( Mibench benchmarks) of 
9.15% and rgbcmy01_lite(EEMBC
benchmarks) the performance improvement of 9.4% is observed.


Proposal on the below were made earlier on path splitting and the reference is 
attached.

The first thing I notice is you haven't included any tests for the newoptimization. We're trying really hard these days to have tests in thesuite for this kind of new optimization/feature work.

I don't offhand know if any of the benchmarks you cite above arefree-enough to derive a testcase from. But one trick many of us use isto instrument the pass and compile some known free software (often gccitself) to find triggering code and use that to generate tests for thenew transformation.


Design changes.

1. The dominators of the block with conditional IF statements say BB1 are found 
and the join node of the IF-THEN-ELSE
inside the loops is found on the blocks dominated by the BB1 and are not 
successor of BB1 are the join node.

This isn't necessarily my area of strength, but it doesn't seem right tome. I guess it would work if there aren't any more nodes in the loopafter the join block, except for the loop latch, but it doesn't seeright in the general case. It might work if you also verify that the IFblock is the immediate dominator of potential join node.


2. The Join node is same as the source of the loops latch basic blocks.

OK.  This is why your initial algorithm in #1 works.


3. If the above conditional in (1) and (2) are found the Join node  same as the 
source of the Loop latch node is
moved into predecessors and the Join node ( Source of the Loop latch node) is 
made empty statement block with only
the phi nodes.

Hopefully you do this by duplicating the join node and manipulating theCFG so that the original predecessors of the join each go to their copyof the join node. The copied join nodes unconditionally then passcontrol to the original join node.


4. In the process of moving the Join node into its predecessors the result of 
the phi node in the Join node propagated
with the corresponding phi arguments  based on which predecessor it came from 
in the Join blocks and move into its predecessors.

Right. Note there's a phi-only cprop pass which is designed to do thispropagation and in theory should be very fast.

5. The Dominator INFO is updated after performing the steps of (1) (2) (3) and 
(4).

Right. Getting the sequencing of the various updates can be tricky.Even more so if you try to do it as a part of jump threading because youhave to (for example) deal with unreachable blocks in the CFG.


6. The Join which is same as the source of the Loop latch node is made empty 
with only the phi node in the Join node.

Right.


7. The implementation is done in Jump threading phase of the machine 
independent optimization on tree based
representation. The path splitting is called after the Jump threading 
optimization is performed.
The following files are modifed.

     a) tree-vrp.c
     b) tree-ssa-threadedge.c
     c) tree-cfg.c
     d) tree-ssa-threadedge.h
     e) tree-cfg.h
     f) cfghooks.c

The diff is attached along with this mail and pasted below.

I would recommend this as a separate pass. One of the goals for GCC 6is to break out the threader into its own pass, if you're piggybackingon the threader it just makes this task harder.

Also by implementing this as a distinct pass you have more freedom toput it where it really belongs in the optimization pipeline. Given theduplication may end up exposing partial redundancies, dead code &propagation opportunities, it's probably best placed before DOM. Thatalso eliminates the need for running phi-only cprop because DOM willtake care of it for you.

When jump threading is broken out into its own distinct pass, it'llprobably end up just before or just after your path splitting pass.


Please share your thoughts and feedbacks on the above optimization and the 
design and the coding and implementation
done for the given above design.

Some general notes. Every function should have a block comment whichdescribes what the function does, its arguments & return value. Thereare many examples throughout the code you can use to understand whatwe're looking for in those function comments.

Those comments are particularly useful during initial review and latermaintenance, so in the future, please include them, even on RFCs.

You'll need to write a ChangeLog entry. It's not particularly excitingor fun to do, but I regularly find that writing the ChangeLog forces meto look at each hunk in the patch and re-justify its purpose. It'scommon that in doing so I rework the patch because I see something wrongor a cleaner/better way to solve a particular problem.


diff --git a/gcc/cfghooks.c b/gcc/cfghooks.c
index 9faa339..559ca96 100644
--- a/gcc/cfghooks.c
+++ b/gcc/cfghooks.c
@@ -581,7 +581,7 @@ delete_basic_block (basic_block bb)

        /* If we remove the header or the latch of a loop, mark the loop for
         removal.  */
-      if (loop->latch == bb
+      if (loop && loop->latch == bb
          || loop->header == bb)
        mark_loop_for_removal (loop);

So what caused you to add this additional test? In general loopstructures are supposed to always be available. The change here impliesthat the loop structures were gone at some point. That seems at firstglance a mistake.


diff --git a/gcc/tree-cfg.c b/gcc/tree-cfg.c
index aed5254..b25e409 100644
--- a/gcc/tree-cfg.c
+++ b/gcc/tree-cfg.c
@@ -1838,6 +1838,64 @@ replace_uses_by (tree name, tree val)
      }
  }

+void
+gimple_threaded_merge_blocks (basic_block a, basic_block b)

If we keep this function it will need a block comment.

I say "if" for a couple reasons. First we already have support routinesthat know how to merge blocks. If you really need to merge blocks youshould try to use them.

Second, I'm not sure that you really need to worry about block mergingin this pass. Just create the duplicates, wire them into the CFG andlet the existing block merging support handle this problem.


  static void
@@ -2061,7 +2119,7 @@ remove_bb (basic_block bb)

        /* If a loop gets removed, clean up the information associated
         with it.  */
-      if (loop->latch == bb
+      if (loop && loop->latch == bb

Same concern here as earlier, why don't you have the loop structuresavailable when this routine gets called?

          || loop->header == bb)
        free_numbers_of_iterations_estimates_loop (loop);
      }
@@ -5779,7 +5837,7 @@ gimple_can_duplicate_bb_p (const_basic_block bb 
ATTRIBUTE_UNUSED)
  /* Create a duplicate of the basic block BB.  NOTE: This does not
     preserve SSA form.  */

-static basic_block
+basic_block
  gimple_duplicate_bb (basic_block bb)

There's already interfaces for duplicating blocks that you can/should beusing. In particular "duplicate_block" or "gimple_duplicate_sese_region".

  {
    basic_block new_bb;
@@ -5858,7 +5916,7 @@ gimple_duplicate_bb (basic_block bb)

  /* Adds phi node arguments for edge E_COPY after basic block duplication.  */

-static void
+void
  add_phi_args_after_copy_edge (edge e_copy)

I don't think you'll need this if you use the existing interfaces forblock duplication.

diff --git a/gcc/tree-ssa-threadedge.c b/gcc/tree-ssa-threadedge.c
index 4303a18..2c7d36d 100644
--- a/gcc/tree-ssa-threadedge.c
+++ b/gcc/tree-ssa-threadedge.c
@@ -1359,6 +1359,322 @@ thread_through_normal_block (edge e,
    return 0;
  }

+static void
+replace_threaded_uses (basic_block a,basic_block b)

If you keep this function, then it'll need a function comment.

It looks like this is just doing const/copy propagation. I think abetter structure is to implement your optimization as a distinct pass,then rely on existing passes such as update_ssa, DOM, CCP to handleupdating the SSA graph and propagation opportunities exposed by yourtransformation.



Similarly for the other replace_ functions.

}

+
+static bool
+is_def_stmt_assert_expr (basic_block src1, basic_block bb,
+                        int pred_index)

As a distinct pass I don't think you'd need to handle ASSERT_EXPRs asthey're removed after VRP has completed. THe only reason you're runninginto them is because the VRP runs the jump threader as a subroutine.

And it's odd that you also check for MIN_EXPR, I don't see how that'suseful. And yet you don't check for MAX_EXPR, so something definitelyseems odd here.

+void
+process_path_splitting (basic_block bb)

Function comment.  This seems to be the meat of the transformation.

+{
+  vec<basic_block> bbs;
+  basic_block bb1;
+  unsigned int i;
+  edge_iterator ei;
+  edge e1;
+  bool found = false ,found1;
+
+  bbs = get_all_dominated_blocks (CDI_DOMINATORS,
+                                  bb );
+  FOR_EACH_VEC_ELT (bbs, i, bb1)
+  {
+    found1 = false;
+    FOR_EACH_EDGE (e1, ei, bb->succs)
+    {
+      if (bb1 == e1->dest)
+        {
+          found = true;
+          found1 = true;
+        }
+    }
+    if (!found1 && found)
+      {
+        unsigned int num_preds = 0;
+        found = false;
+
+        FOR_EACH_EDGE (e1, ei, bb1->succs)
+        {
+          if (e1->flags & (EDGE_DFS_BACK))
+            found = true;
+        }
+
+        if (found && EDGE_COUNT(bb1->preds) == 2)
+          {
+            basic_block temp_bb = bb1;
+            basic_block src1,src2;
+            unsigned int k = 0, num_stmt = 0, num_phis = 0;
+            bool is_src1 = false, is_src2 = false;
+            gimple_stmt_iterator  psi,gsi;
+
+            FOR_EACH_EDGE (e1, ei, bb1->preds)
+            {
+              if ((e1->flags & (EDGE_DFS_BACK)))
+                continue;
+              if (k == 1)
+                {
+                  if (single_succ_p(e1->src) &&
+                      single_succ_edge (e1->src)->flags & EDGE_FALLTHRU)
+                    {
+                      src2 = e1->src;
+                      is_src2 = true;
+                    }
+                }
+                else
+                  {
+                    if (single_succ_p(e1->src) &&
+                        single_succ_edge (e1->src)->flags & EDGE_FALLTHRU)
+                      {
+                        is_src1 = true;
+                        src1 = e1->src;
+                      }
+                  }
+                k++;
+           }
+          for (gsi = gsi_start_bb (bb1); !gsi_end_p (gsi); gsi_next (&gsi))
+            num_stmt++;
+
+          if ((num_stmt > 1) && is_src1 && is_src2)
+            {
+              bool found_non_virtual_result = false;
+
+              for (psi = gsi_start_phis (bb1); !gsi_end_p (psi); )
+                 {
+                   gimple stmt = gsi_stmt (psi);
+
+                   if (!virtual_operand_p (gimple_phi_result (stmt)))
+                     num_phis++;
+                   else
+                     found_non_virtual_result = true;
+
+                   gsi_next(&psi);
+                }
+
+              if ((num_phis >1) || found_non_virtual_result)
+                return;
+
+              if (!(is_def_stmt_assert_expr (src1, bb1, 1) &&
+                  is_def_stmt_assert_expr (src2, bb1, 2)))
+                return;

So up to this point is just identification of candidates. Seems like itought to factor into its own function.

And from here to the end of the function is the actual transformation,which feels like it should factor into its own function as well.

+
+              temp_bb = gimple_duplicate_bb (bb1);
+
+              replace_threaded_uses (bb1, temp_bb);
+
+              make_edge (src1, temp_bb, EDGE_FALLTHRU);
+
+              make_edge (src2,temp_bb,EDGE_FALLTHRU);
+
+              FOR_EACH_EDGE (e1, ei, temp_bb->preds)
+                add_phi_args_after_copy_edge (e1);
+
+              replace_phis(temp_bb, bb1);
+
+              replace_uses_phi(bb1, 1);
+
+              gimple_threaded_merge_blocks (src1, bb1);
+
+              replace_uses_phi(temp_bb, 2);
+
+              gimple_threaded_merge_blocks (src2, temp_bb);

+
+              while (EDGE_COUNT (temp_bb->preds) != 0)
+                remove_edge (EDGE_PRED (temp_bb, 0));
+
+              while (EDGE_COUNT (temp_bb->succs) != 0)
+                remove_edge (EDGE_SUCC (temp_bb, 0));
+
+              if (dom_info_available_p (CDI_DOMINATORS))
+                delete_from_dominance_info (CDI_DOMINATORS, temp_bb);
+
+              if (dom_info_available_p (CDI_POST_DOMINATORS))
+                delete_from_dominance_info (CDI_POST_DOMINATORS, temp_bb);
+
+              remove_phi_nodes (temp_bb);
+              expunge_block (temp_bb);
+
+              return;
+            }
+          }
+      }
+  }
+}
+

diff --git a/gcc/tree-vrp.c b/gcc/tree-vrp.c
index fbecdf7..c0db184 100644
--- a/gcc/tree-vrp.c
+++ b/gcc/tree-vrp.c
@@ -10141,6 +10141,10 @@ identify_jump_threads (void)
              thread_across_edge (dummy, e, true, &equiv_stack,
                                  simplify_stmt_for_jump_threading);
            }
+          if (gimple_code(last) == GIMPLE_COND)
+            {
+              process_path_splitting(bb);

So instead of running this as a subroutine of vrp, make it a first classpass. Then find the right place to put the pass. I think that'llultimately be cleaner and simpler as you won't be trying to duplicatethings like const/copy propagation.

Overall I like the the transformation you're trying to make, but wouldlike to see this as a separate pass, if it can't be a separate pass,please indicate why. Second, you should be looking to re-use existingblock duplication routines, SSA updating, etc as much as possible.


jeff

Re: [RFC] Design and Implementation for Path Splitting for Loop with Conditional IF-THEN-ELSE

Reply via email to