On Fri, 24 Apr 2015, Jan Hubicka wrote:
Hi,
I was looking into reordering optimization queue of early passes. This is
motivated by PR57249 and fact that I run into some super sily loops while
looking into firefox dumps. It indeed makes a lot of sense for me as for
code
dealing with short arrays this enables more SRA/FRE/DSE.
I added cunrolle pass that differ from cunrolli by not allowing code size
growth even at -O3 (because we do not know what loops are hot yet).
We currently unroll tiny loop with 2 calls that I think needs to be tammed
down, I can do that if the patch seems to make sense.
Please.
I tried several options and ended up adding cunrolle before FRE and reordering
FRE and SRA: SRA needs constant propagation to happen after unrolling to work
and I think value numbering does work pretty well on non-SRAed datastructures.
I also added DCE just before unrolling. This increases number of unrolls by
about 60% on both tramp3d and eon. (basically we want to have DCE and cprop
done to make unroller metrics go resonably well)
I've re-ordered SRA and ealias for GCC5 because ealias benefits from SRA
while SRA doesn't need PTA. You undo that improvement.
We should improve the unroller instead of requiring DCE before it.
On tramp3d there is not great code quality improvement (which is expected),
but
we get stronger early opts. In particular the number of basic blocks at
release_ssa time is 8% down, the inline size estimate at IPA time about 2%.
We do 124 unrollings early, 136 at cunrolli time and 132 in cunroll
compared to 228 at cunrolli and 130 at cunroll without the patch.
New early DCE pass does 8005 deletions, early cddce does 9307, first late dce
does 2888.
Without patch early cddce does 9510 (so the patch basically doubles statement
count
we get rid of), first late dce does 8587 (almost 3 times as much ).
This seems like a significant decrease of garbage pushed through IPA pipeline
(which in turn confuses inline metrics).
On DealII we early unroll 477 loops (out of 11k), 421 at cunrolli and 122 at
cunroll
without patch we unroll loops 1428 in cunrolli and 127 loop in cunroll
early dce1 removes 24133 and cddce 15599, late dce does 7698 statements.
without patch we cddce1 33007 statements and first late dce does 8717
statements.
20% increase of # of statements we get rid of early and 13% decrease in late
DCE.
Number of basic blocks at release_ssa time drops from 270859 to 260485, by 21%
number of statements by 4%.
If this looks resonable, I would suggest doing one change at a time, that
is first adding extra dce pass, then reordering SRA and finally adding the
cunrolle pass (after implementing the logic to not unroll calls)
I think that adding more passes to the early pipeline is bad. Sure, it
will help weird C++ code-bases. But it will slow down the compile for
all the rest.
As we want early unrolling to get the secondary effects by performing
better FRE/DCE/DSE I think trying to do a better job in value-numbering
for the kind of loops we are talking about would be better. For tramp3d
we are talking about cases like
for (i=0; i3; ++i)
dim[i] = i;
or similar code which ends up storing to a single array. There is
the vn_reference_lookup_3 function in tree-ssa-sccvn.c which is
the canonical place for value-numbering tricks. It simply
needs to be taught how to look through loops.
I'd like to net get into the game of huge pass order dependences in
early-opts - that just shows our scalar cleanups are too weak.
Ideally we'd just have a single pass in early-opts...
Richard.
Honza
Index: tree-ssa-loop-ivcanon.c
===
--- tree-ssa-loop-ivcanon.c (revision 222391)
+++ tree-ssa-loop-ivcanon.c (working copy)
@@ -1571,4 +1571,59 @@ make_pass_complete_unrolli (gcc::context
return new pass_complete_unrolli (ctxt);
}
+/* Early complete unrolling pass; do only those internal loops where code
+ size gets reduced. */
+namespace {
+
+const pass_data pass_data_complete_unrolle =
+{
+ GIMPLE_PASS, /* type */
+ cunrolle, /* name */
+ OPTGROUP_LOOP, /* optinfo_flags */
+ TV_COMPLETE_UNROLL, /* tv_id */
+ ( PROP_cfg | PROP_ssa ), /* properties_required */
+ 0, /* properties_provided */
+ 0, /* properties_destroyed */
+ 0, /* todo_flags_start */
+ 0, /* todo_flags_finish */
+};
+class pass_complete_unrolle : public gimple_opt_pass
+{
+public:
+ pass_complete_unrolle (gcc::context *ctxt)
+: gimple_opt_pass (pass_data_complete_unrolle, ctxt)
+ {}
+
+ /* opt_pass methods: */
+ virtual bool gate (function *) { return optimize = 2; }
+ virtual unsigned int execute (function *);
+
+}; // class pass_complete_unrolle
+
+unsigned int
+pass_complete_unrolle::execute (function *fun)
+{
+ unsigned ret = 0;
+
+ loop_optimizer_init (LOOPS_NORMAL
+| LOOPS_HAVE_RECORDED_EXITS);
+ if