[Bug tree-optimization/57534] [6/7/8 Regression]: Performance regression versus 4.7.3, 4.8.1 is ~15% slower

2018-03-01 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534

--- Comment #31 from rguenther at suse dot de  ---
On Wed, 28 Feb 2018, law at redhat dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534
> 
> --- Comment #27 from Jeffrey A. Law  ---
> WRT c#21.  There is a good paper from Click on an integrated GVN/GCM/reassoc.
> 
> "Combining Analyses, Combining Optimizations"  It was published in PLDI. I've
> probably got a copy here if you want it.

I've googled the thesis but cannot find the paper - so yes, if you have
a copy please send it my way.

[Bug tree-optimization/57534] [6/7/8 Regression]: Performance regression versus 4.7.3, 4.8.1 is ~15% slower

2018-02-28 Thread law at redhat dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534

Jeffrey A. Law  changed:

   What|Removed |Added

   Target Milestone|6.5 |9.0

--- Comment #30 from Jeffrey A. Law  ---
OK.  Moving to gcc-9.  Note I think there's multiple instances of this issue
that would likely be helped if we can get IVOPTS to DTRT here.  I'll try to
link them via "See Also" the next time I come across them.

[Bug tree-optimization/57534] [6/7/8 Regression]: Performance regression versus 4.7.3, 4.8.1 is ~15% slower

2018-02-28 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534

--- Comment #29 from amker at gcc dot gnu.org ---
(In reply to Jeffrey A. Law from comment #28)
> BTW, ISTM that we need Bin to chime in on the complexity of improving this
> in IVOPTS -- ie, is it gcc-8 or gcc-9 material.  If the latter, then we
> should adjust the target milestone.

Yes, it's more like a GCC9 stuff.
For the record, I think this could be generalized to cover addressing mode
issue revealed by PR84037.  That is, not only non-iv addresses, but also
addresses after iv_rewrite could be refined for best selection of addressing
mode.

[Bug tree-optimization/57534] [6/7/8 Regression]: Performance regression versus 4.7.3, 4.8.1 is ~15% slower

2018-02-28 Thread law at redhat dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534

--- Comment #28 from Jeffrey A. Law  ---
BTW, ISTM that we need Bin to chime in on the complexity of improving this in
IVOPTS -- ie, is it gcc-8 or gcc-9 material.  If the latter, then we should
adjust the target milestone.

[Bug tree-optimization/57534] [6/7/8 Regression]: Performance regression versus 4.7.3, 4.8.1 is ~15% slower

2018-02-28 Thread law at redhat dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534

--- Comment #27 from Jeffrey A. Law  ---
WRT c#21.  There is a good paper from Click on an integrated GVN/GCM/reassoc.

"Combining Analyses, Combining Optimizations"  It was published in PLDI. I've
probably got a copy here if you want it.

[Bug tree-optimization/57534] [6/7/8 Regression]: Performance regression versus 4.7.3, 4.8.1 is ~15% slower

2018-02-28 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534

--- Comment #26 from amker at gcc dot gnu.org ---
Note the testcase:

void timer_stop();

volatile long keepgoing = 0;
double hand_benchmark_cache_ronly( double *x, long limit, long *oloops, double
*ous) {
long index = 0, loops = 0;
double sum = (double)0;
double sum2 = (double)0;
again:   sum += x[index] + x[index+1] + x[index+2] + x[index+3];
sum2 += x[index+4] + x[index+5] + x[index+6] + x[index+7];
if ((index += 8) < limit) goto again;
else if (keepgoing) {
index = 0;
goto again;
}
timer_stop();
x[0] = (double)sum + (double)sum2;
x[1] = (double)index;
}

equals to:

{
long index = 0, loops = 0;
double sum = (double)0;
double sum2 = (double)0;
do {
  index = 0;
  do {
sum += x[index] + x[index+1] + x[index+2] + x[index+3];
sum2 += x[index+4] + x[index+5] + x[index+6] + x[index+7];
  } while ((index += 8) < limit);
} while (keepgoing);

timer_stop();
x[0] = (double)sum + (double)sum2;
x[1] = (double)index;
}

Thus the inner loop can be well analyzed.

Interesting thing is we need threading (or similar) to canonicalize the loop, a
transformation corrupting loop structure often before.

[Bug tree-optimization/57534] [6/7/8 Regression]: Performance regression versus 4.7.3, 4.8.1 is ~15% slower

2018-02-28 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534

--- Comment #25 from amker at gcc dot gnu.org ---
(In reply to Aldy Hernandez from comment #23)
> (In reply to amker from comment #22)
> 
> > > So my suggestion would be to see if you can make SLSR generate
> > > TARGET_MEM_REFs
> > > based on some common infrastructure with IVOPTs.
> > Yes, I also believe strength reduction infrastructure in IVOPTs could be
> > (easily?) extended to cover this case.  Either export interface from IVOPTs
> > and do it in SLSR etc., or simply handle such loops specially in IVOPTs it
> > self.  Then it would be something like vect vs slp.  I have time now and can
> > play with this idea.
> 
> Bin, do you want to take over the PR?

Yes, I can study the IVOPTs part.  Though not sure if this PR reveals
opportunities in other passes or not.  Thanks

[Bug tree-optimization/57534] [6/7/8 Regression]: Performance regression versus 4.7.3, 4.8.1 is ~15% slower

2018-02-28 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534

--- Comment #24 from rguenther at suse dot de  ---
On Wed, 28 Feb 2018, amker at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534
> 
> --- Comment #22 from amker at gcc dot gnu.org ---
> (In reply to Richard Biener from comment #21)
> > One major flaw here is that IVOPTs does nothing on this loop because it
> > doesn't find any induction variables.  So basically this is highlighting the
> > fact that
> > we leave addressing mode computation to the RTL combiner.  SLSR was supposed
> > to be driving the way to do addressing mode computation on non-loopy code
> > and what it does is make using autoinc/dec easier (but as you saw later
> > forwprop passes wreck somewhat with this).  It would be nice if IVOPTs
> > could consider 'ind' as "induction variable" and be able to do addressing
> > mode selection on scalar code.  Bin, what would be required to do this?
> it depends on scev for this.  I don't think this can (or easily) be modeled in
> scev, ind could be arbitrarily changed by cond().  But...

Ah.  Note that SCEV can now handle SESE regions (which includes
a single basic-block).  It should instantiate all ind+CST uses
up to a common BB "invariant" ind SSA name.

> > We'd basically consider all SSA names used in addresses as IVs (or maybe
> > only multi-uses?).
> > 
> > So my suggestion would be to see if you can make SLSR generate
> > TARGET_MEM_REFs
> > based on some common infrastructure with IVOPTs.
> Yes, I also believe strength reduction infrastructure in IVOPTs could be
> (easily?) extended to cover this case.  Either export interface from IVOPTs 
> and
> do it in SLSR etc., or simply handle such loops specially in IVOPTs it self. 
> Then it would be something like vect vs slp.  I have time now and can play 
> with
> this idea.

Great!

[Bug tree-optimization/57534] [6/7/8 Regression]: Performance regression versus 4.7.3, 4.8.1 is ~15% slower

2018-02-28 Thread aldyh at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534

--- Comment #23 from Aldy Hernandez  ---
(In reply to amker from comment #22)

> > So my suggestion would be to see if you can make SLSR generate
> > TARGET_MEM_REFs
> > based on some common infrastructure with IVOPTs.
> Yes, I also believe strength reduction infrastructure in IVOPTs could be
> (easily?) extended to cover this case.  Either export interface from IVOPTs
> and do it in SLSR etc., or simply handle such loops specially in IVOPTs it
> self.  Then it would be something like vect vs slp.  I have time now and can
> play with this idea.

Bin, do you want to take over the PR?

[Bug tree-optimization/57534] [6/7/8 Regression]: Performance regression versus 4.7.3, 4.8.1 is ~15% slower

2018-02-28 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534

--- Comment #22 from amker at gcc dot gnu.org ---
(In reply to Richard Biener from comment #21)
> One major flaw here is that IVOPTs does nothing on this loop because it
> doesn't find any induction variables.  So basically this is highlighting the
> fact that
> we leave addressing mode computation to the RTL combiner.  SLSR was supposed
> to be driving the way to do addressing mode computation on non-loopy code
> and what it does is make using autoinc/dec easier (but as you saw later
> forwprop passes wreck somewhat with this).  It would be nice if IVOPTs
> could consider 'ind' as "induction variable" and be able to do addressing
> mode selection on scalar code.  Bin, what would be required to do this?
it depends on scev for this.  I don't think this can (or easily) be modeled in
scev, ind could be arbitrarily changed by cond().  But...

> We'd basically consider all SSA names used in addresses as IVs (or maybe
> only multi-uses?).
> 
> So my suggestion would be to see if you can make SLSR generate
> TARGET_MEM_REFs
> based on some common infrastructure with IVOPTs.
Yes, I also believe strength reduction infrastructure in IVOPTs could be
(easily?) extended to cover this case.  Either export interface from IVOPTs and
do it in SLSR etc., or simply handle such loops specially in IVOPTs it self. 
Then it would be something like vect vs slp.  I have time now and can play with
this idea.

Thanks,

>

[Bug tree-optimization/57534] [6/7/8 Regression]: Performance regression versus 4.7.3, 4.8.1 is ~15% slower

2018-02-28 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534

Richard Biener  changed:

   What|Removed |Added

 CC||amker at gcc dot gnu.org,
   ||rguenth at gcc dot gnu.org

--- Comment #21 from Richard Biener  ---
One major flaw here is that IVOPTs does nothing on this loop because it doesn't
find any induction variables.  So basically this is highlighting the fact that
we leave addressing mode computation to the RTL combiner.  SLSR was supposed
to be driving the way to do addressing mode computation on non-loopy code
and what it does is make using autoinc/dec easier (but as you saw later
forwprop passes wreck somewhat with this).  It would be nice if IVOPTs
could consider 'ind' as "induction variable" and be able to do addressing
mode selection on scalar code.  Bin, what would be required to do this?
We'd basically consider all SSA names used in addresses as IVs (or maybe
only multi-uses?).

So my suggestion would be to see if you can make SLSR generate TARGET_MEM_REFs
based on some common infrastructure with IVOPTs.

I also wonder why we have

leaq0(,%rdx,8), %rax
movsd   (%rbx,%rdx,8), %xmm1
addsd   8(%rbx,%rax), %xmm1
...

and not

   leaq0(,%rdx,8), %rax
   movsd   (%rbx,%rax), %xmm1
   addsd   8(%rbx,%rax), %xmm1

possibly RTL fwprop work.  And then the question is whether using
8(%rbx,%rdx,8)
would be really better -- IIRC the most complex addressing modes need more
uops.  Of course CSEing 0(%rbx,%rdx,8) ould have enabled to use (%rax),
8(%rax),
etc. as Jeff says.

So for the reassoc pass it's major issue is that it works locally (per
single-use chain) rather than globally in conjunction with CSE.  The
only way it enables CSE is by sorting the chain against common criteria
but that criteria is similarly "local".  But I'm not aware of any
global CSE + reassoc implementations / papers.

Yes, I still want to explore "lowering" GIMPLE to -fwrapv (as RTL is) late
in the pipeline (after loop and late VRP, some pass shuffling is necessary).

[Bug tree-optimization/57534] [6/7/8 Regression]: Performance regression versus 4.7.3, 4.8.1 is ~15% slower

2018-02-28 Thread aldyh at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534

Aldy Hernandez  changed:

   What|Removed |Added

 Target|i?86-*-*|i?86-*-*, x86-64

--- Comment #20 from Aldy Hernandez  ---
For the record, an even smaller test that I believe shows the problem even on
x86-64:

int ind;
int cond(void);

double hand_benchmark_cache_ronly( double *x) {
double sum=0.0;
while (cond())
sum += x[ind] + x[ind+1] + x[ind+2] + x[ind+3];
return sum;
}

with -O2 we get an extra lea in the loop:

movslq  ind(%rip), %rdx
leaq0(,%rdx,8), %rax<-- BOO!
movsd   8(%rbx,%rax), %xmm0
addsd   (%rbx,%rdx,8), %xmm0
addsd   16(%rbx,%rax), %xmm0
addsd   24(%rbx,%rax), %xmm0
addsd   8(%rsp), %xmm0
movsd   %xmm0, 8(%rsp)

whereas with -O2 -fno-tree-slsr we get:

movslq  ind(%rip), %rax
movsd   8(%rbx,%rax,8), %xmm0
addsd   (%rbx,%rax,8), %xmm0
addsd   16(%rbx,%rax,8), %xmm0
addsd   24(%rbx,%rax,8), %xmm0
addsd   8(%rsp), %xmm0
movsd   %xmm0, 8(%rsp)

The .optimized dump for -O2 shows ind*8 being CSE'd away, and the address being
calculated as "ind*8 + CST":

  _2 = (long unsigned int) ind.0_1;
  _3 = _2 * 8;  ;; common expression: ind*8
  _4 = x_26(D) + _3;
  _5 = *_4;
  _7 = _3 + 8;  ;; ind*8 + 8
  _8 = x_26(D) + _7;
  _9 = *_8;
...

Whereas with -O2 -fno-tree-slsr, the address is calculated as "(ind+CST) * 8 +
x"

  ind.0_1 = ind;
  _2 = (long unsigned int) ind.0_1;
  _3 = _2 * 8;
  _4 = x_26(D) + _3;
  _5 = *_4;
  _6 = _2 + 1;
  _7 = _6 * 8;  ;; (ind+1) * 8
  _8 = x_26(D) + _7;;; (ind+1) * 8 + x
  _9 = *_8;

Ironically the -O2 gimple looks more efficient, but gets crappy addressing on
x86.

[Bug tree-optimization/57534] [6/7/8 Regression]: Performance regression versus 4.7.3, 4.8.1 is ~15% slower

2018-02-27 Thread glisse at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534

--- Comment #19 from Marc Glisse  ---
(In reply to Aldy Hernandez from comment #16)
> It seems tree-ssa-reassoc.c avoids reassociating most non-bit expressions by
> design (because of signed overflows):
[...]
> So instead of whacking tree-ssa-reassoc.c to handle POINTER_PLUS_EXPR and
> PLUS_EXPR, etc,

There have been suggestions that we should do reassoc after some kind of
lowering, where all integer operations wrap, maybe pointer operations are
replaced with integer operations, etc.

[Bug tree-optimization/57534] [6/7/8 Regression]: Performance regression versus 4.7.3, 4.8.1 is ~15% slower

2018-02-27 Thread law at redhat dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534

--- Comment #18 from Jeffrey A. Law  ---
A couple more notes.  It could also well be the case that reassociating in a
way that encourages lea could be good for x86, but bad for other targets.

I also suspect this is closely related to other BZs in the database where we've
regressed due to poor addressing mode selections.  If you're able to make
progress here I'll dig them out and we can cross-check your patch against those
other bugs.

[Bug tree-optimization/57534] [6/7/8 Regression]: Performance regression versus 4.7.3, 4.8.1 is ~15% slower

2018-02-27 Thread law at redhat dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534

--- Comment #17 from Jeffrey A. Law  ---
It could well end up being a case where we need to look to see if the
expressions are likely to CSE to determine which is better.

I'm not sure if reassoc has that kind of capability.

[Bug tree-optimization/57534] [6/7/8 Regression]: Performance regression versus 4.7.3, 4.8.1 is ~15% slower

2018-02-27 Thread aldyh at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534

Aldy Hernandez  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |aldyh at gcc dot gnu.org

--- Comment #16 from Aldy Hernandez  ---
> Coming out of SSA for hand_benchmark_cache_ronly(), we seem to be
> calculating:
> 
> ((index + 1) * 8) + x
> ((index + 2) * 8) + x
> ((index + 3) * 8) + x
> etc
> 
> After slsr we have:
> 
> (index * 8) + x
> (((index * 8) + 8) + x)
> index * 8) + 8) + 8) + x)
> 
> And finally after forwprop4:
> 
> (index * 8) + x
> (((index * 8) + 8) + x)
> (((index * 8) + 16) + x)
> 
> Are you suggesting we reassociate the above as:
> 
> ((index * 8) + CONSTANT) + x

Err, what I meant is that we should reassociate as (index * 8 + x) + CONSTANT.

It seems tree-ssa-reassoc.c avoids reassociating most non-bit expressions by
design (because of signed overflows):

  /* For non-bit or min/max operations we can't associate
 all types.  Verify that here.  */

(After the following:

PR tree-optimization/45232
* tree-ssa-reassoc.c (can_reassociate_p): Disable re-association
for types with undefined overflow.
(reassociate_bb): Allow re-associating of bit and min/max
operations for types with undefined overflow.
* tree-ssa-forwprop.c (associate_plusminus): New function.
)

The code that introduced the above, moved some arithmetic reassociation to 
reassociate_plusminus() in forwprop, which eventually landed in match.md.  So
instead of whacking tree-ssa-reassoc.c to handle POINTER_PLUS_EXPR and
PLUS_EXPR, etc, perhaps we can reassociate from match.pd early in the
compilation process.

So, reassociate:

(ind + 3) * 8 + x

into:

(8*ind + x) + 24

And pray for the best.  I'll take a look.

[Bug tree-optimization/57534] [6/7/8 Regression]: Performance regression versus 4.7.3, 4.8.1 is ~15% slower

2018-02-26 Thread aldyh at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534

--- Comment #15 from Aldy Hernandez  ---
(In reply to Jeffrey A. Law from comment #14)
> I wonder if this could be addressed with a more reasonable address
> computation reassociation.

> ISTM that we associating as (x + (y + c)) which seems like a mistake in this
> case.
> 
> If we associated as (x + y) + c, then the (x + y) would become a common
> subexpression eliminating a bunch of unnecessary address computations.

Coming out of SSA for hand_benchmark_cache_ronly(), we seem to be calculating:

((index + 1) * 8) + x
((index + 2) * 8) + x
((index + 3) * 8) + x
etc

After slsr we have:

(index * 8) + x
(((index * 8) + 8) + x)
index * 8) + 8) + 8) + x)

And finally after forwprop4:

(index * 8) + x
(((index * 8) + 8) + x)
(((index * 8) + 16) + x)

Are you suggesting we reassociate the above as:

((index * 8) + CONSTANT) + x

Is there a preferred place to put this kind of transformation?  A separate pass
after forwprop4?  Much earlier?  Is there a pass that already does this kind of
juggling?

BTW, it seems like a pass like tree-ssa-reassoc, tries to precisely convert:

x + (y + constant)  INTO (x + CONSTANT) + y

which is the opposite of what we want.

[Bug tree-optimization/57534] [6/7/8 Regression]: Performance regression versus 4.7.3, 4.8.1 is ~15% slower

2018-01-17 Thread law at redhat dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534

Jeffrey A. Law  changed:

   What|Removed |Added

 CC||aldyh at gcc dot gnu.org,
   ||law at redhat dot com

--- Comment #14 from Jeffrey A. Law  ---
I wonder if this could be addressed with a more reasonable address computation
reassociation.

  _14 = index.6_13 * 8; <- here
  _16 = x_15(D) + _14;
  _17 = *_16;
  _20 = _14 + 8;
  _21 = x_15(D) + _20;
  _22 = *_21;
  _23 = _17 + _22;
  _26 = _14 + 16;
  _27 = x_15(D) + _26;

ISTM that we associating as (x + (y + c)) which seems like a mistake in this
case.

If we associated as (x + y) + c, then the (x + y) would become a common
subexpression eliminating a bunch of unnecessary address computations.

Of course, I'm sure there are cases where (x + (y + c)) is better.  Though ISTM
that generally (x + y) + c ought to be better for the addressing modes
available on most targets.

Anyway, something to think about.