[RFC] Adding unroller and DCE to early passes

Jan Hubicka Thu, 23 Apr 2015 21:54:37 -0700

Hi,
I was looking into reordering optimization queue of early passes. This is
motivated by PR57249 and fact that I run into some super sily loops while
looking into firefox dumps.  It  indeed makes a lot of sense for me as for code
dealing with short arrays this enables more SRA/FRE/DSE.


I added cunrolle pass that differ from cunrolli by not allowing code size
growth even at -O3 (because we do not know what loops are hot yet).
We currently unroll tiny loop with 2 calls that I think needs to be tammed
down, I can do that if the patch seems to make sense.

I tried several options and ended up adding cunrolle before FRE and reordering
FRE and SRA: SRA needs constant propagation to happen after unrolling to work
and I think value numbering does work pretty well on non-SRAed datastructures.
I also added DCE just before unrolling. This increases number of unrolls by
about 60% on both tramp3d and eon. (basically we want to have DCE and cprop
done to make unroller metrics go resonably well)

On tramp3d there is not great code quality improvement (which is expected), but
we get stronger early opts. In particular the number of basic blocks at
release_ssa time is 8% down, the inline size estimate at IPA time about 2%.

We do 124 unrollings early, 136 at cunrolli time and 132 in cunroll
compared to 228 at cunrolli and 130 at cunroll without the patch.

New early DCE pass does 8005 deletions, early cddce does 9307, first late dce
does 2888.
Without patch early cddce does 9510 (so the patch basically doubles statement 
count
we get rid of), first late dce does 8587 (almost 3 times as much).

This seems like a significant decrease of garbage pushed through IPA pipeline
(which in turn confuses inline metrics).


On DealII we early unroll 477 loops (out of 11k), 421 at cunrolli and 122 at 
cunroll
without patch we unroll loops 1428 in cunrolli and 127 loop in cunroll

early dce1 removes 24133 and cddce 15599, late dce does 7698 statements.
without patch we cddce1 33007 statements and first late dce does 8717 
statements.
20% increase of # of statements we get rid of early and 13% decrease in late 
DCE.

Number of basic blocks at release_ssa time drops from 270859 to 260485, by 21%
number of statements by 4%.

If this looks resonable, I would suggest doing one change at a time, that
is first adding extra dce pass, then reordering SRA and finally adding the
cunrolle pass (after implementing the logic to not unroll calls)

Honza

Index: tree-ssa-loop-ivcanon.c
===================================================================
--- tree-ssa-loop-ivcanon.c     (revision 222391)
+++ tree-ssa-loop-ivcanon.c     (working copy)
@@ -1571,4 +1571,59 @@ make_pass_complete_unrolli (gcc::context
   return new pass_complete_unrolli (ctxt);
 }
 
+/* Early complete unrolling pass; do only those internal loops where code
+   size gets reduced.  */
 
+namespace {
+
+const pass_data pass_data_complete_unrolle =
+{
+  GIMPLE_PASS, /* type */
+  "cunrolle", /* name */
+  OPTGROUP_LOOP, /* optinfo_flags */
+  TV_COMPLETE_UNROLL, /* tv_id */
+  ( PROP_cfg | PROP_ssa ), /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  0, /* todo_flags_finish */
+};
+class pass_complete_unrolle : public gimple_opt_pass
+{
+public:
+  pass_complete_unrolle (gcc::context *ctxt)
+    : gimple_opt_pass (pass_data_complete_unrolle, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *) { return optimize >= 2; }
+  virtual unsigned int execute (function *);
+
+}; // class pass_complete_unrolle
+
+unsigned int
+pass_complete_unrolle::execute (function *fun)
+{
+  unsigned ret = 0;
+
+  loop_optimizer_init (LOOPS_NORMAL
+                      | LOOPS_HAVE_RECORDED_EXITS);
+  if (number_of_loops (fun) > 1)
+    {
+      scev_initialize ();
+      ret = tree_unroll_loops_completely (false, false);
+      free_numbers_of_iterations_estimates ();
+      scev_finalize ();
+    }
+  loop_optimizer_finalize ();
+
+  return ret;
+}
+
+} // anon namespace
+
+gimple_opt_pass *
+make_pass_complete_unrolle (gcc::context *ctxt)
+{
+  return new pass_complete_unrolle (ctxt);
+}
Index: tree-pass.h
===================================================================
--- tree-pass.h (revision 222391)
+++ tree-pass.h (working copy)
@@ -375,6 +375,7 @@ extern gimple_opt_pass *make_pass_vector
 extern gimple_opt_pass *make_pass_slp_vectorize (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_complete_unroll (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_complete_unrolli (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_complete_unrolle (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_parallelize_loops (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_loop_prefetch (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_iv_optimize (gcc::context *ctxt);
Index: passes.def
===================================================================
--- passes.def  (revision 222391)
+++ passes.def  (working copy)
@@ -83,11 +83,17 @@ along with GCC; see the file COPYING3.
          /* After CCP we rewrite no longer addressed locals into SSA
             form if possible.  */
          NEXT_PASS (pass_forwprop);
-         NEXT_PASS (pass_sra_early);
+         /* DCE enables considerably more complete unrolling.  */
+          NEXT_PASS (pass_dce);
+         /* Unroll small loops to enable SRA of tiny arrays tripped by them.
+            It is important to propagate constants in FRE before sra_early
+            happens. */
+          NEXT_PASS (pass_complete_unrolle);
          /* pass_build_ealias is a dummy pass that ensures that we
             execute TODO_rebuild_alias at this point.  */
          NEXT_PASS (pass_build_ealias);
          NEXT_PASS (pass_fre);
+         NEXT_PASS (pass_sra_early);
          NEXT_PASS (pass_merge_phi);
           NEXT_PASS (pass_dse);
          NEXT_PASS (pass_cd_dce);

[RFC] Adding unroller and DCE to early passes

Reply via email to