Re: [RFC] Adding unroller and DCE to early passes

2015-04-24 Thread Richard Biener
On Fri, 24 Apr 2015, Jan Hubicka wrote:

 Hi,
 I was looking into reordering optimization queue of early passes. This is
 motivated by PR57249 and fact that I run into some super sily loops while
 looking into firefox dumps.  It  indeed makes a lot of sense for me as for 
 code
 dealing with short arrays this enables more SRA/FRE/DSE.
 
 I added cunrolle pass that differ from cunrolli by not allowing code size
 growth even at -O3 (because we do not know what loops are hot yet).
 We currently unroll tiny loop with 2 calls that I think needs to be tammed
 down, I can do that if the patch seems to make sense.

Please.

 I tried several options and ended up adding cunrolle before FRE and reordering
 FRE and SRA: SRA needs constant propagation to happen after unrolling to work
 and I think value numbering does work pretty well on non-SRAed datastructures.
 I also added DCE just before unrolling. This increases number of unrolls by
 about 60% on both tramp3d and eon. (basically we want to have DCE and cprop
 done to make unroller metrics go resonably well)

I've re-ordered SRA and ealias for GCC5 because ealias benefits from SRA
while SRA doesn't need PTA.  You undo that improvement.

We should improve the unroller instead of requiring DCE before it.

 On tramp3d there is not great code quality improvement (which is expected), 
 but
 we get stronger early opts. In particular the number of basic blocks at
 release_ssa time is 8% down, the inline size estimate at IPA time about 2%.
 
 We do 124 unrollings early, 136 at cunrolli time and 132 in cunroll
 compared to 228 at cunrolli and 130 at cunroll without the patch.

 New early DCE pass does 8005 deletions, early cddce does 9307, first late dce
 does 2888.
 Without patch early cddce does 9510 (so the patch basically doubles statement 
 count
 we get rid of), first late dce does 8587 (almost 3 times as much ).
 
 This seems like a significant decrease of garbage pushed through IPA pipeline
 (which in turn confuses inline metrics).
 
 
 On DealII we early unroll 477 loops (out of 11k), 421 at cunrolli and 122 at 
 cunroll
 without patch we unroll loops 1428 in cunrolli and 127 loop in cunroll
 
 early dce1 removes 24133 and cddce 15599, late dce does 7698 statements.
 without patch we cddce1 33007 statements and first late dce does 8717 
 statements.
 20% increase of # of statements we get rid of early and 13% decrease in late 
 DCE.
 
 Number of basic blocks at release_ssa time drops from 270859 to 260485, by 21%
 number of statements by 4%.
 
 If this looks resonable, I would suggest doing one change at a time, that
 is first adding extra dce pass, then reordering SRA and finally adding the
 cunrolle pass (after implementing the logic to not unroll calls)

I think that adding more passes to the early pipeline is bad.  Sure, it
will help weird C++ code-bases.  But it will slow down the compile for
all the rest.

As we want early unrolling to get the secondary effects by performing
better FRE/DCE/DSE I think trying to do a better job in value-numbering
for the kind of loops we are talking about would be better.  For tramp3d
we are talking about cases like

 for (i=0; i3; ++i)
  dim[i] = i;

or similar code which ends up storing to a single array.  There is
the vn_reference_lookup_3 function in tree-ssa-sccvn.c which is
the canonical place for value-numbering tricks.  It simply
needs to be taught how to look through loops.

I'd like to net get into the game of huge pass order dependences in
early-opts - that just shows our scalar cleanups are too weak.
Ideally we'd just have a single pass in early-opts...

Richard.
 
 Honza
 
 Index: tree-ssa-loop-ivcanon.c
 ===
 --- tree-ssa-loop-ivcanon.c   (revision 222391)
 +++ tree-ssa-loop-ivcanon.c   (working copy)
 @@ -1571,4 +1571,59 @@ make_pass_complete_unrolli (gcc::context
return new pass_complete_unrolli (ctxt);
  }
  
 +/* Early complete unrolling pass; do only those internal loops where code
 +   size gets reduced.  */
  
 +namespace {
 +
 +const pass_data pass_data_complete_unrolle =
 +{
 +  GIMPLE_PASS, /* type */
 +  cunrolle, /* name */
 +  OPTGROUP_LOOP, /* optinfo_flags */
 +  TV_COMPLETE_UNROLL, /* tv_id */
 +  ( PROP_cfg | PROP_ssa ), /* properties_required */
 +  0, /* properties_provided */
 +  0, /* properties_destroyed */
 +  0, /* todo_flags_start */
 +  0, /* todo_flags_finish */
 +};
 +class pass_complete_unrolle : public gimple_opt_pass
 +{
 +public:
 +  pass_complete_unrolle (gcc::context *ctxt)
 +: gimple_opt_pass (pass_data_complete_unrolle, ctxt)
 +  {}
 +
 +  /* opt_pass methods: */
 +  virtual bool gate (function *) { return optimize = 2; }
 +  virtual unsigned int execute (function *);
 +
 +}; // class pass_complete_unrolle
 +
 +unsigned int
 +pass_complete_unrolle::execute (function *fun)
 +{
 +  unsigned ret = 0;
 +
 +  loop_optimizer_init (LOOPS_NORMAL
 +| LOOPS_HAVE_RECORDED_EXITS);
 +  if 

Re: [RFC] Adding unroller and DCE to early passes

2015-04-24 Thread Jan Hubicka
  I added cunrolle pass that differ from cunrolli by not allowing code size
  growth even at -O3 (because we do not know what loops are hot yet).
  We currently unroll tiny loop with 2 calls that I think needs to be tammed
  down, I can do that if the patch seems to make sense.
 
 Please.

OK, will do that.
 
  I tried several options and ended up adding cunrolle before FRE and 
  reordering
  FRE and SRA: SRA needs constant propagation to happen after unrolling to 
  work
  and I think value numbering does work pretty well on non-SRAed 
  datastructures.
  I also added DCE just before unrolling. This increases number of unrolls by
  about 60% on both tramp3d and eon. (basically we want to have DCE and cprop
  done to make unroller metrics go resonably well)
 
 I've re-ordered SRA and ealias for GCC5 because ealias benefits from SRA
 while SRA doesn't need PTA.  You undo that improvement.

OK, I see. I assumed that the PTA solutions are updated when SRA introduce new 
scalars.
We could simply do limited CCP when unrolling happened lifting the need for FRE 
in between cunrolle and SRA.  I dimly remember we even have code for that?
 
 We should improve the unroller instead of requiring DCE before it.

If I get loop with dead code in it, because of einline or gimple production or 
whatever,
what unroller should do short of  doing its own DCE pass on the whole function 
body (well the mark, ignoring sweep)?

Honza
 
 I think that adding more passes to the early pipeline is bad.  Sure, it
 will help weird C++ code-bases.  But it will slow down the compile for
 all the rest.

I am not 100% convinced about this especially with LTO, where we need to pickle
all garbage we did not eliminated.
 
 As we want early unrolling to get the secondary effects by performing
 better FRE/DCE/DSE I think trying to do a better job in value-numbering
 for the kind of loops we are talking about would be better.  For tramp3d
 we are talking about cases like
 
  for (i=0; i3; ++i)
   dim[i] = i;
 
 or similar code which ends up storing to a single array.  There is
 the vn_reference_lookup_3 function in tree-ssa-sccvn.c which is
 the canonical place for value-numbering tricks.  It simply
 needs to be taught how to look through loops.
 
 I'd like to net get into the game of huge pass order dependences in
 early-opts - that just shows our scalar cleanups are too weak.
 Ideally we'd just have a single pass in early-opts...

Other motivation for getting rid of loops is to make controldata flow more
explicit for IPA analysers.
If you have code o nvectors of size 3 that does series of operations, it is very
hard to reorder/fuse/merge loops as needed for the scalar optimizations...

Honza