[Bug tree-optimization/60172] [4.9/4.10 Regression] ARM performance regression from trunk@207239

2014-07-16 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172

Jakub Jelinek  changed:

   What|Removed |Added

   Target Milestone|4.9.1   |4.9.2

--- Comment #24 from Jakub Jelinek  ---
GCC 4.9.1 has been released.


[Bug tree-optimization/60172] [4.9/4.10 Regression] ARM performance regression from trunk@207239

2014-06-18 Thread bpringlemeir at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172

--- Comment #23 from Bill Pringlemeir  ---
(In reply to Bill Pringlemeir from comment #22)
> The good ARM assembler uses the 'mla' instruction which is a 'multiply and
> accumulate'.  Since this is not recognized, the multiply result needs a
> temporary register to do the add with and I think this cause the extra
> registers.  I believe you should look to see why the 'mla' is not matched. 
> I don't know if the x86 has an op-code like this.

Er, I see.  The 'mla' comes as a result of seeing that the array index
calculations can be reused.  Sorry for the noise.


[Bug tree-optimization/60172] [4.9/4.10 Regression] ARM performance regression from trunk@207239

2014-06-18 Thread bpringlemeir at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172

Bill Pringlemeir  changed:

   What|Removed |Added

 CC||bpringlemeir at gmail dot com

--- Comment #22 from Bill Pringlemeir  ---
The good ARM assembler uses the 'mla' instruction which is a 'multiply and
accumulate'.  Since this is not recognized, the multiply result needs a
temporary register to do the add with and I think this cause the extra
registers.  I believe you should look to see why the 'mla' is not matched.  I
don't know if the x86 has an op-code like this.


[Bug tree-optimization/60172] [4.9/4.10 Regression] ARM performance regression from trunk@207239

2014-05-15 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172

--- Comment #21 from rguenther at suse dot de  ---
On Thu, 15 May 2014, thomas.preudhomme at arm dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172
> 
> --- Comment #20 from Thomas Preud'homme  ---
> (In reply to rguent...@suse.de from comment #19)
> > On Thu, 15 May 2014, thomas.preudhomme at arm dot com wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172
> > > 
> > > --- Comment #18 from Thomas Preud'homme  > > com> ---
> > > (In reply to Richard Biener from comment #17)
> > > > 
> > > > Citing myself:
> > > > 
> > > > On the GIMPLE level before expansion we have
> > > > 
> > > >  +40 = Arr_2_Par_Ref_22(D) + (_41 + pretmp_20);
> > > > 
> > > >  _51 = Arr_2_Par_Ref_22(D) + (_41 + (pretmp_20 + 1000));
> > > > 
> > > > so if _51 were Arr_2_Par_Ref_22(D) + ((_41 + pretmp_20) + 1000);
> > > > 
> > > > then _41 + pretmp_20 would be fully redundant with the expression needed
> > > > by _40.
> > > 
> > > Yes I saw that but I was wondering why would reassoc try this association
> > > rather than another since the header of the file doesn't mention any 
> > > special
> > > treatment of explicit integer constants.
> > > 
> > > Besides, wouldn't it still misses that fact that _51 = _40 + 1000?
> > 
> > Yes.  But reassoc doesn't associate across POINTER_PLUS_EXPRs.
> 
> Is there a reason for that?

Yes.  It's not easy and it involves undefined overflow (reassoc
doesn't associate signed arithmetic either)

> > 
> > RTL CSE could catch it, but for it the association would have to
> > be the same for both.  If we start from the proposed form
> > then at RTL expansion time we could associate
> > pointer + (X + CST) to (pointer + X) + CST.
> 
> Right.
> 
> > 
> > Feels all somewhat hacky, of course (and relies on TER).  There
> > may be cases where doing the opposite is better (for example
> > if you have ptr1 + (X + 1000) and ptr2 + (X + 1000)).  Association
> > to make CSE possible is always hard if CSE itself cannot associate
> > to maximize the number of CSE opportunities.  So at the moment
> > any choice is just canonicalization.
> 
> Exactly my thought. I'm not sure if that's what you have in mind when you 
> write
> association for CSE but I was thinking about a scheme that ressemble what
> tree_to_aff_combination_expand does and organize all expanded expression to
> compare them easily (read efficiently). With such a capability it would then
> not be necessary to do the first replacement with forprop+reassoc+dom as
> everything could be done in CSE.

Yeah, but that's not how CSE on GIMPLE or RTL works right now ;)
(patches welcome?)  I suppose teaching reassoc to look for CSE
opportunities may be easier (needs separating analysis and transform
stages for the whole function).

Richard.


[Bug tree-optimization/60172] [4.9/4.10 Regression] ARM performance regression from trunk@207239

2014-05-15 Thread thomas.preudhomme at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172

--- Comment #20 from Thomas Preud'homme  ---
(In reply to rguent...@suse.de from comment #19)
> On Thu, 15 May 2014, thomas.preudhomme at arm dot com wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172
> > 
> > --- Comment #18 from Thomas Preud'homme  
> > ---
> > (In reply to Richard Biener from comment #17)
> > > 
> > > Citing myself:
> > > 
> > > On the GIMPLE level before expansion we have
> > > 
> > >  +40 = Arr_2_Par_Ref_22(D) + (_41 + pretmp_20);
> > > 
> > >  _51 = Arr_2_Par_Ref_22(D) + (_41 + (pretmp_20 + 1000));
> > > 
> > > so if _51 were Arr_2_Par_Ref_22(D) + ((_41 + pretmp_20) + 1000);
> > > 
> > > then _41 + pretmp_20 would be fully redundant with the expression needed
> > > by _40.
> > 
> > Yes I saw that but I was wondering why would reassoc try this association
> > rather than another since the header of the file doesn't mention any special
> > treatment of explicit integer constants.
> > 
> > Besides, wouldn't it still misses that fact that _51 = _40 + 1000?
> 
> Yes.  But reassoc doesn't associate across POINTER_PLUS_EXPRs.

Is there a reason for that?

> 
> RTL CSE could catch it, but for it the association would have to
> be the same for both.  If we start from the proposed form
> then at RTL expansion time we could associate
> pointer + (X + CST) to (pointer + X) + CST.

Right.

> 
> Feels all somewhat hacky, of course (and relies on TER).  There
> may be cases where doing the opposite is better (for example
> if you have ptr1 + (X + 1000) and ptr2 + (X + 1000)).  Association
> to make CSE possible is always hard if CSE itself cannot associate
> to maximize the number of CSE opportunities.  So at the moment
> any choice is just canonicalization.

Exactly my thought. I'm not sure if that's what you have in mind when you write
association for CSE but I was thinking about a scheme that ressemble what
tree_to_aff_combination_expand does and organize all expanded expression to
compare them easily (read efficiently). With such a capability it would then
not be necessary to do the first replacement with forprop+reassoc+dom as
everything could be done in CSE.


[Bug tree-optimization/60172] [4.9/4.10 Regression] ARM performance regression from trunk@207239

2014-05-15 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172

--- Comment #19 from rguenther at suse dot de  ---
On Thu, 15 May 2014, thomas.preudhomme at arm dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172
> 
> --- Comment #18 from Thomas Preud'homme  ---
> (In reply to Richard Biener from comment #17)
> > 
> > Citing myself:
> > 
> > On the GIMPLE level before expansion we have
> > 
> >  +40 = Arr_2_Par_Ref_22(D) + (_41 + pretmp_20);
> > 
> >  _51 = Arr_2_Par_Ref_22(D) + (_41 + (pretmp_20 + 1000));
> > 
> > so if _51 were Arr_2_Par_Ref_22(D) + ((_41 + pretmp_20) + 1000);
> > 
> > then _41 + pretmp_20 would be fully redundant with the expression needed
> > by _40.
> 
> Yes I saw that but I was wondering why would reassoc try this association
> rather than another since the header of the file doesn't mention any special
> treatment of explicit integer constants.
> 
> Besides, wouldn't it still misses that fact that _51 = _40 + 1000?

Yes.  But reassoc doesn't associate across POINTER_PLUS_EXPRs.

RTL CSE could catch it, but for it the association would have to
be the same for both.  If we start from the proposed form
then at RTL expansion time we could associate
pointer + (X + CST) to (pointer + X) + CST.

Feels all somewhat hacky, of course (and relies on TER).  There
may be cases where doing the opposite is better (for example
if you have ptr1 + (X + 1000) and ptr2 + (X + 1000)).  Association
to make CSE possible is always hard if CSE itself cannot associate
to maximize the number of CSE opportunities.  So at the moment
any choice is just canonicalization.

> > Note that IIRC one issue with TER is that it is no longer happening as
> > there are dead stmts around that confuse its has_single_use logic.  Thus
> > placing a dce pass right before expand would fix that and might be a good
> > idea anyway (see comment #3).  Implementing a "proper" poor-mans SSA-based
> > DCE would be a good way out (out-of-SSA already has one to remove dead
> > PHIs).
> 
> Ok


[Bug tree-optimization/60172] [4.9/4.10 Regression] ARM performance regression from trunk@207239

2014-05-15 Thread thomas.preudhomme at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172

--- Comment #18 from Thomas Preud'homme  ---
(In reply to Richard Biener from comment #17)
> 
> Citing myself:
> 
> On the GIMPLE level before expansion we have
> 
>  +40 = Arr_2_Par_Ref_22(D) + (_41 + pretmp_20);
> 
>  _51 = Arr_2_Par_Ref_22(D) + (_41 + (pretmp_20 + 1000));
> 
> so if _51 were Arr_2_Par_Ref_22(D) + ((_41 + pretmp_20) + 1000);
> 
> then _41 + pretmp_20 would be fully redundant with the expression needed
> by _40.

Yes I saw that but I was wondering why would reassoc try this association
rather than another since the header of the file doesn't mention any special
treatment of explicit integer constants.

Besides, wouldn't it still misses that fact that _51 = _40 + 1000?

> 
> Note that IIRC one issue with TER is that it is no longer happening as
> there are dead stmts around that confuse its has_single_use logic.  Thus
> placing a dce pass right before expand would fix that and might be a good
> idea anyway (see comment #3).  Implementing a "proper" poor-mans SSA-based
> DCE would be a good way out (out-of-SSA already has one to remove dead
> PHIs).

Ok


[Bug tree-optimization/60172] [4.9/4.10 Regression] ARM performance regression from trunk@207239

2014-05-15 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172

--- Comment #17 from Richard Biener  ---
(In reply to Thomas Preud'homme from comment #16)
> Hi Richard,
> 
> could you expand on what you said in comment #13? I don't see how reassoc
> could help cse here. From what I understood, reassoc tries to group per
> rank. In our case, we have (view of the source with loop unrolling):
> 
> Arr_2_Par_Ref [Int_Loc] [Int_Loc] = Int_Loc;
> /* some stmts */
> Arr_2_Par_Ref [Int_Loc+10] [Int_Loc] = Arr_1_Par_Ref [Int_Loc];
> 
> If I'm not mistaken, in the first case you'd have:
> 
> Int_Loc * 4
> Int_Loc * 100
> Arr_2_Par_Ref
> 
> that would be added together with several statements. However in the second
> case you'd have:
> 
> Int_Loc * 4
> Int_Loc * 100
> 1000
> Arr_2_Par_Ref
> 
> that would be added together with several statements. I don't see how could
> 1000 being added first or last, it seems to me that it's always going to be
> in an intermediate statement and thus not all redanduncy would be eliminated
> by CSE.
> 
> Please let me know if my reasonning is flawed so that I can progress toward
> a solution.

Citing myself:

On the GIMPLE level before expansion we have

 +40 = Arr_2_Par_Ref_22(D) + (_41 + pretmp_20);

 _51 = Arr_2_Par_Ref_22(D) + (_41 + (pretmp_20 + 1000));

so if _51 were Arr_2_Par_Ref_22(D) + ((_41 + pretmp_20) + 1000);

then _41 + pretmp_20 would be fully redundant with the expression needed
by _40.

Note that IIRC one issue with TER is that it is no longer happening as
there are dead stmts around that confuse its has_single_use logic.  Thus
placing a dce pass right before expand would fix that and might be a good
idea anyway (see comment #3).  Implementing a "proper" poor-mans SSA-based
DCE would be a good way out (out-of-SSA already has one to remove dead
PHIs).


[Bug tree-optimization/60172] [4.9/4.10 Regression] ARM performance regression from trunk@207239

2014-05-14 Thread thomas.preudhomme at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172

--- Comment #16 from Thomas Preud'homme  ---
Hi Richard,

could you expand on what you said in comment #13? I don't see how reassoc could
help cse here. From what I understood, reassoc tries to group per rank. In our
case, we have (view of the source with loop unrolling):

Arr_2_Par_Ref [Int_Loc] [Int_Loc] = Int_Loc;
/* some stmts */
Arr_2_Par_Ref [Int_Loc+10] [Int_Loc] = Arr_1_Par_Ref [Int_Loc];

If I'm not mistaken, in the first case you'd have:

Int_Loc * 4
Int_Loc * 100
Arr_2_Par_Ref

that would be added together with several statements. However in the second
case you'd have:

Int_Loc * 4
Int_Loc * 100
1000
Arr_2_Par_Ref

that would be added together with several statements. I don't see how could
1000 being added first or last, it seems to me that it's always going to be in
an intermediate statement and thus not all redanduncy would be eliminated by
CSE.

Please let me know if my reasonning is flawed so that I can progress toward a
solution.


[Bug tree-optimization/60172] [4.9/4.10 Regression] ARM performance regression from trunk@207239

2014-05-09 Thread rguenther at suse dot de
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172

--- Comment #15 from rguenther at suse dot de  ---
On Fri, 9 May 2014, thomas.preudhomme at arm dot com wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172
> 
> Thomas Preud'homme  changed:
> 
>What|Removed |Added
> 
>  CC||thomas.preudhomme at arm dot 
> com
> 
> --- Comment #14 from Thomas Preud'homme  ---
> (In reply to Steven Bosscher from comment #12)
> > Annotated "bad expansion":
> > ;; _40 = Arr_2_Par_Ref_22(D) + _12;
> > 22: r138=r128+r121  
> > 23: r127=r132+r138  // r127=Arr_2_Par_Ref+r128+r121
> > 
> > ;; _32 = _20 + 1000;
> > 29: r124=r121+1000
> > 
> > ;; MEM[(int[25] *)_51 + 20B] = _34;
> > 32: r141=r132+r124  // r141=Arr_2_Par_Ref+r121+1000
> > 33: r142=r141+r128  // r142=Arr_2_Par_Ref+r128+r121+1000 (==r127+1000)
> > 34: [r142+20]=r126
> 
> So in gimple the two offsets are added first and then added to the pointer
> while after expansion the first offset is added to the pointer and then the
> second offset. Is it normal that the order of operations seems to change?

Yes, that's TER at work

[Bug tree-optimization/60172] [4.9/4.10 Regression] ARM performance regression from trunk@207239

2014-05-09 Thread thomas.preudhomme at arm dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172

Thomas Preud'homme  changed:

   What|Removed |Added

 CC||thomas.preudhomme at arm dot 
com

--- Comment #14 from Thomas Preud'homme  ---
(In reply to Steven Bosscher from comment #12)
> Annotated "bad expansion":
> ;; _40 = Arr_2_Par_Ref_22(D) + _12;
> 22: r138=r128+r121
> 23: r127=r132+r138  // r127=Arr_2_Par_Ref+r128+r121
> 
> ;; _32 = _20 + 1000;
> 29: r124=r121+1000
> 
> ;; MEM[(int[25] *)_51 + 20B] = _34;
> 32: r141=r132+r124  // r141=Arr_2_Par_Ref+r121+1000
> 33: r142=r141+r128  // r142=Arr_2_Par_Ref+r128+r121+1000 (==r127+1000)
> 34: [r142+20]=r126

So in gimple the two offsets are added first and then added to the pointer
while after expansion the first offset is added to the pointer and then the
second offset. Is it normal that the order of operations seems to change?

[Bug tree-optimization/60172] [4.9/4.10 Regression] ARM performance regression from trunk@207239

2014-04-14 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|--- |4.9.1
Summary|[4.9 regression] ARM|[4.9/4.10 Regression] ARM
   |performance regression from |performance regression from
   |trunk@207239|trunk@207239