On 05/16/2015 06:49 AM, Ajit Kumar Agarwal wrote:
I have Designed and  implemented with the following design for the path 
splitting of the loops with conditional IF-THEN-ELSE.
The implementation has gone through the bootstrap for Microblaze target along 
DEJA GNU regressions tests and
running the MIBench/EEMBC benchmarks. There is no regression seen in Deja GNU 
tests and Deja C++ tests results are
better with the path splitting changes(lesser failures). The Mibench/EEMBC 
benchmarks are run and no performance degradation is
seen and the performance improvement in telcom_gsm( Mibench benchmarks) of 
9.15% and rgbcmy01_lite(EEMBC
benchmarks) the performance improvement of 9.4% is observed.

Proposal on the below were made earlier on path splitting and the reference is 
attached.
The first thing I notice is you haven't included any tests for the new optimization. We're trying really hard these days to have tests in the suite for this kind of new optimization/feature work.

I don't offhand know if any of the benchmarks you cite above are free-enough to derive a testcase from. But one trick many of us use is to instrument the pass and compile some known free software (often gcc itself) to find triggering code and use that to generate tests for the new transformation.



Design changes.

1. The dominators of the block with conditional IF statements say BB1 are found 
and the join node of the IF-THEN-ELSE
inside the loops is found on the blocks dominated by the BB1 and are not 
successor of BB1 are the join node.
This isn't necessarily my area of strength, but it doesn't seem right to me. I guess it would work if there aren't any more nodes in the loop after the join block, except for the loop latch, but it doesn't see right in the general case. It might work if you also verify that the IF block is the immediate dominator of potential join node.




2. The Join node is same as the source of the loops latch basic blocks.
OK.  This is why your initial algorithm in #1 works.


3. If the above conditional in (1) and (2) are found the Join node  same as the 
source of the Loop latch node is
moved into predecessors and the Join node ( Source of the Loop latch node) is 
made empty statement block with only
the phi nodes.
Hopefully you do this by duplicating the join node and manipulating the CFG so that the original predecessors of the join each go to their copy of the join node. The copied join nodes unconditionally then pass control to the original join node.



4. In the process of moving the Join node into its predecessors the result of 
the phi node in the Join node propagated
with the corresponding phi arguments  based on which predecessor it came from 
in the Join blocks and move into its predecessors.
Right. Note there's a phi-only cprop pass which is designed to do this propagation and in theory should be very fast.

5. The Dominator INFO is updated after performing the steps of (1) (2) (3) and 
(4).
Right. Getting the sequencing of the various updates can be tricky. Even more so if you try to do it as a part of jump threading because you have to (for example) deal with unreachable blocks in the CFG.



6. The Join which is same as the source of the Loop latch node is made empty 
with only the phi node in the Join node.
Right.


7. The implementation is done in Jump threading phase of the machine 
independent optimization on tree based
representation. The path splitting is called after the Jump threading 
optimization is performed.
The following files are modifed.

     a) tree-vrp.c
     b) tree-ssa-threadedge.c
     c) tree-cfg.c
     d) tree-ssa-threadedge.h
     e) tree-cfg.h
     f) cfghooks.c

The diff is attached along with this mail and pasted below.
I would recommend this as a separate pass. One of the goals for GCC 6 is to break out the threader into its own pass, if you're piggybacking on the threader it just makes this task harder.

Also by implementing this as a distinct pass you have more freedom to put it where it really belongs in the optimization pipeline. Given the duplication may end up exposing partial redundancies, dead code & propagation opportunities, it's probably best placed before DOM. That also eliminates the need for running phi-only cprop because DOM will take care of it for you.

When jump threading is broken out into its own distinct pass, it'll probably end up just before or just after your path splitting pass.



Please share your thoughts and feedbacks on the above optimization and the 
design and the coding and implementation
done for the given above design.
Some general notes. Every function should have a block comment which describes what the function does, its arguments & return value. There are many examples throughout the code you can use to understand what we're looking for in those function comments.

Those comments are particularly useful during initial review and later maintenance, so in the future, please include them, even on RFCs.

You'll need to write a ChangeLog entry. It's not particularly exciting or fun to do, but I regularly find that writing the ChangeLog forces me to look at each hunk in the patch and re-justify its purpose. It's common that in doing so I rework the patch because I see something wrong or a cleaner/better way to solve a particular problem.



diff --git a/gcc/cfghooks.c b/gcc/cfghooks.c
index 9faa339..559ca96 100644
--- a/gcc/cfghooks.c
+++ b/gcc/cfghooks.c
@@ -581,7 +581,7 @@ delete_basic_block (basic_block bb)

        /* If we remove the header or the latch of a loop, mark the loop for
         removal.  */
-      if (loop->latch == bb
+      if (loop && loop->latch == bb
          || loop->header == bb)
        mark_loop_for_removal (loop);
So what caused you to add this additional test? In general loop structures are supposed to always be available. The change here implies that the loop structures were gone at some point. That seems at first glance a mistake.



diff --git a/gcc/tree-cfg.c b/gcc/tree-cfg.c
index aed5254..b25e409 100644
--- a/gcc/tree-cfg.c
+++ b/gcc/tree-cfg.c
@@ -1838,6 +1838,64 @@ replace_uses_by (tree name, tree val)
      }
  }

+void
+gimple_threaded_merge_blocks (basic_block a, basic_block b)
If we keep this function it will need a block comment.

I say "if" for a couple reasons. First we already have support routines that know how to merge blocks. If you really need to merge blocks you should try to use them.

Second, I'm not sure that you really need to worry about block merging in this pass. Just create the duplicates, wire them into the CFG and let the existing block merging support handle this problem.




  static void
@@ -2061,7 +2119,7 @@ remove_bb (basic_block bb)

        /* If a loop gets removed, clean up the information associated
         with it.  */
-      if (loop->latch == bb
+      if (loop && loop->latch == bb
Same concern here as earlier, why don't you have the loop structures available when this routine gets called?


          || loop->header == bb)
        free_numbers_of_iterations_estimates_loop (loop);
      }
@@ -5779,7 +5837,7 @@ gimple_can_duplicate_bb_p (const_basic_block bb 
ATTRIBUTE_UNUSED)
  /* Create a duplicate of the basic block BB.  NOTE: This does not
     preserve SSA form.  */

-static basic_block
+basic_block
  gimple_duplicate_bb (basic_block bb)
There's already interfaces for duplicating blocks that you can/should be using. In particular "duplicate_block" or "gimple_duplicate_sese_region".

  {
    basic_block new_bb;
@@ -5858,7 +5916,7 @@ gimple_duplicate_bb (basic_block bb)

  /* Adds phi node arguments for edge E_COPY after basic block duplication.  */

-static void
+void
  add_phi_args_after_copy_edge (edge e_copy)
I don't think you'll need this if you use the existing interfaces for block duplication.


diff --git a/gcc/tree-ssa-threadedge.c b/gcc/tree-ssa-threadedge.c
index 4303a18..2c7d36d 100644
--- a/gcc/tree-ssa-threadedge.c
+++ b/gcc/tree-ssa-threadedge.c
@@ -1359,6 +1359,322 @@ thread_through_normal_block (edge e,
    return 0;
  }

+static void
+replace_threaded_uses (basic_block a,basic_block b)
If you keep this function, then it'll need a function comment.

It looks like this is just doing const/copy propagation. I think a better structure is to implement your optimization as a distinct pass, then rely on existing passes such as update_ssa, DOM, CCP to handle updating the SSA graph and propagation opportunities exposed by your transformation.


Similarly for the other replace_ functions.

}
+
+static bool
+is_def_stmt_assert_expr (basic_block src1, basic_block bb,
+                        int pred_index)
As a distinct pass I don't think you'd need to handle ASSERT_EXPRs as they're removed after VRP has completed. THe only reason you're running into them is because the VRP runs the jump threader as a subroutine.

And it's odd that you also check for MIN_EXPR, I don't see how that's useful. And yet you don't check for MAX_EXPR, so something definitely seems odd here.

+
+void
+process_path_splitting (basic_block bb)
Function comment.  This seems to be the meat of the transformation.



+{
+  vec<basic_block> bbs;
+  basic_block bb1;
+  unsigned int i;
+  edge_iterator ei;
+  edge e1;
+  bool found = false ,found1;
+
+  bbs = get_all_dominated_blocks (CDI_DOMINATORS,
+                                  bb );
+  FOR_EACH_VEC_ELT (bbs, i, bb1)
+  {
+    found1 = false;
+    FOR_EACH_EDGE (e1, ei, bb->succs)
+    {
+      if (bb1 == e1->dest)
+        {
+          found = true;
+          found1 = true;
+        }
+    }
+    if (!found1 && found)
+      {
+        unsigned int num_preds = 0;
+        found = false;
+
+        FOR_EACH_EDGE (e1, ei, bb1->succs)
+        {
+          if (e1->flags & (EDGE_DFS_BACK))
+            found = true;
+        }
+
+        if (found && EDGE_COUNT(bb1->preds) == 2)
+          {
+            basic_block temp_bb = bb1;
+            basic_block src1,src2;
+            unsigned int k = 0, num_stmt = 0, num_phis = 0;
+            bool is_src1 = false, is_src2 = false;
+            gimple_stmt_iterator  psi,gsi;
+
+            FOR_EACH_EDGE (e1, ei, bb1->preds)
+            {
+              if ((e1->flags & (EDGE_DFS_BACK)))
+                continue;
+              if (k == 1)
+                {
+                  if (single_succ_p(e1->src) &&
+                      single_succ_edge (e1->src)->flags & EDGE_FALLTHRU)
+                    {
+                      src2 = e1->src;
+                      is_src2 = true;
+                    }
+                }
+                else
+                  {
+                    if (single_succ_p(e1->src) &&
+                        single_succ_edge (e1->src)->flags & EDGE_FALLTHRU)
+                      {
+                        is_src1 = true;
+                        src1 = e1->src;
+                      }
+                  }
+                k++;
+           }
+          for (gsi = gsi_start_bb (bb1); !gsi_end_p (gsi); gsi_next (&gsi))
+            num_stmt++;
+
+          if ((num_stmt > 1) && is_src1 && is_src2)
+            {
+              bool found_non_virtual_result = false;
+
+              for (psi = gsi_start_phis (bb1); !gsi_end_p (psi); )
+                 {
+                   gimple stmt = gsi_stmt (psi);
+
+                   if (!virtual_operand_p (gimple_phi_result (stmt)))
+                     num_phis++;
+                   else
+                     found_non_virtual_result = true;
+
+                   gsi_next(&psi);
+                }
+
+              if ((num_phis >1) || found_non_virtual_result)
+                return;
+
+              if (!(is_def_stmt_assert_expr (src1, bb1, 1) &&
+                  is_def_stmt_assert_expr (src2, bb1, 2)))
+                return;
So up to this point is just identification of candidates. Seems like it ought to factor into its own function.

And from here to the end of the function is the actual transformation, which feels like it should factor into its own function as well.

+
+              temp_bb = gimple_duplicate_bb (bb1);
+
+              replace_threaded_uses (bb1, temp_bb);
+
+              make_edge (src1, temp_bb, EDGE_FALLTHRU);
+
+              make_edge (src2,temp_bb,EDGE_FALLTHRU);
+
+              FOR_EACH_EDGE (e1, ei, temp_bb->preds)
+                add_phi_args_after_copy_edge (e1);
+
+              replace_phis(temp_bb, bb1);
+
+              replace_uses_phi(bb1, 1);
+
+              gimple_threaded_merge_blocks (src1, bb1);
+
+              replace_uses_phi(temp_bb, 2);
+
+              gimple_threaded_merge_blocks (src2, temp_bb);

+
+              while (EDGE_COUNT (temp_bb->preds) != 0)
+                remove_edge (EDGE_PRED (temp_bb, 0));
+
+              while (EDGE_COUNT (temp_bb->succs) != 0)
+                remove_edge (EDGE_SUCC (temp_bb, 0));
+
+              if (dom_info_available_p (CDI_DOMINATORS))
+                delete_from_dominance_info (CDI_DOMINATORS, temp_bb);
+
+              if (dom_info_available_p (CDI_POST_DOMINATORS))
+                delete_from_dominance_info (CDI_POST_DOMINATORS, temp_bb);
+
+              remove_phi_nodes (temp_bb);
+              expunge_block (temp_bb);
+
+              return;
+            }
+          }
+      }
+  }
+}
+


diff --git a/gcc/tree-vrp.c b/gcc/tree-vrp.c
index fbecdf7..c0db184 100644
--- a/gcc/tree-vrp.c
+++ b/gcc/tree-vrp.c
@@ -10141,6 +10141,10 @@ identify_jump_threads (void)
              thread_across_edge (dummy, e, true, &equiv_stack,
                                  simplify_stmt_for_jump_threading);
            }
+          if (gimple_code(last) == GIMPLE_COND)
+            {
+              process_path_splitting(bb);
So instead of running this as a subroutine of vrp, make it a first class pass. Then find the right place to put the pass. I think that'll ultimately be cleaner and simpler as you won't be trying to duplicate things like const/copy propagation.


Overall I like the the transformation you're trying to make, but would like to see this as a separate pass, if it can't be a separate pass, please indicate why. Second, you should be looking to re-use existing block duplication routines, SSA updating, etc as much as possible.

jeff

Reply via email to