> -----Original Message-----
> From: Bin.Cheng [mailto:[email protected]]
> Sent: 20 June 2014 06:25
> To: Bingfeng Mei
> Cc: [email protected]
> Subject: Re: regs_used estimation in IVOPTS seriously flawed
>
> On Tue, Jun 17, 2014 at 10:59 PM, Bingfeng Mei <[email protected]> wrote:
> > Hi,
> > I am looking at a performance regression in our code. A big loop
> produces
> > and uses a lot of temporary variables inside the loop body. The
> problem
> > appears that IVOPTS pass creates even more induction variables (from
> original
> > 2 to 27). It causes a lot of register spilling later and performance
> Do you have a simplified case which can be posted here? I guess it
> affects some other targets too.
>
> > take a severe hit. I looked into tree-ssa-loop-ivopts.c, it does call
> > estimate_reg_pressure_cost function to take # of registers into
> > consideration. The second parameter passed as data->regs_used is
> supposed
> > to represent old register usage before IVOPTS.
> >
> > return size + estimate_reg_pressure_cost (size, data->regs_used,
> data->speed,
> > data->body_includes_call);
> >
> > In this case, it is mere 2 by following calculation. Essentially, it
> only counts
> > all loop invariant registers, ignoring all registers produced/used
> inside the loop.
> There are two kinds of registers produced/used inside the loop. One
> is induction variable irrelevant, it includes non-linear uses as
> mentioned by Richard. The other kind relates to induction variable
> rewrite, and one issue with this kind is expression generated during
> iv use rewriting is not reflecting the estimated one in ivopt very
> well.
>
As a short term solution, I tried some simple non-linear functions as Richard
suggested
to penalize using too many IVs. For example, the following cost in
ivopts_global_cost_for_size fixed my regression and actually improves
performance
slightly over a set of benchmarks we usually use.
return size * (1 + size * 0.2)
+ estimate_reg_pressure_cost (size, data->regs_used, data->speed,
data->body_includes_call);
The trouble is choice of this non-linear function could be highly target
dependent
(# of registers?). I don't have setup to prove performance gain for other
targets.
I also tried counting all SSA names and divide it by a factor. It does seem to
work
so well.
Long term, if we have infrastructure to analyze maximal live variable in a loop
at tree-level, that would be great for many loop optimizations.
Thanks,
Bingfeng