[Bug tree-optimization/60172] [4.9/4.10 Regression] ARM performance regression from trunk@207239
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172 Jakub Jelinek changed: What|Removed |Added Target Milestone|4.9.1 |4.9.2 --- Comment #24 from Jakub Jelinek --- GCC 4.9.1 has been released.
[Bug tree-optimization/60172] [4.9/4.10 Regression] ARM performance regression from trunk@207239
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172 --- Comment #23 from Bill Pringlemeir --- (In reply to Bill Pringlemeir from comment #22) > The good ARM assembler uses the 'mla' instruction which is a 'multiply and > accumulate'. Since this is not recognized, the multiply result needs a > temporary register to do the add with and I think this cause the extra > registers. I believe you should look to see why the 'mla' is not matched. > I don't know if the x86 has an op-code like this. Er, I see. The 'mla' comes as a result of seeing that the array index calculations can be reused. Sorry for the noise.
[Bug tree-optimization/60172] [4.9/4.10 Regression] ARM performance regression from trunk@207239
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172 Bill Pringlemeir changed: What|Removed |Added CC||bpringlemeir at gmail dot com --- Comment #22 from Bill Pringlemeir --- The good ARM assembler uses the 'mla' instruction which is a 'multiply and accumulate'. Since this is not recognized, the multiply result needs a temporary register to do the add with and I think this cause the extra registers. I believe you should look to see why the 'mla' is not matched. I don't know if the x86 has an op-code like this.
[Bug tree-optimization/60172] [4.9/4.10 Regression] ARM performance regression from trunk@207239
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172 --- Comment #21 from rguenther at suse dot de --- On Thu, 15 May 2014, thomas.preudhomme at arm dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172 > > --- Comment #20 from Thomas Preud'homme --- > (In reply to rguent...@suse.de from comment #19) > > On Thu, 15 May 2014, thomas.preudhomme at arm dot com wrote: > > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172 > > > > > > --- Comment #18 from Thomas Preud'homme > > com> --- > > > (In reply to Richard Biener from comment #17) > > > > > > > > Citing myself: > > > > > > > > On the GIMPLE level before expansion we have > > > > > > > > +40 = Arr_2_Par_Ref_22(D) + (_41 + pretmp_20); > > > > > > > > _51 = Arr_2_Par_Ref_22(D) + (_41 + (pretmp_20 + 1000)); > > > > > > > > so if _51 were Arr_2_Par_Ref_22(D) + ((_41 + pretmp_20) + 1000); > > > > > > > > then _41 + pretmp_20 would be fully redundant with the expression needed > > > > by _40. > > > > > > Yes I saw that but I was wondering why would reassoc try this association > > > rather than another since the header of the file doesn't mention any > > > special > > > treatment of explicit integer constants. > > > > > > Besides, wouldn't it still misses that fact that _51 = _40 + 1000? > > > > Yes. But reassoc doesn't associate across POINTER_PLUS_EXPRs. > > Is there a reason for that? Yes. It's not easy and it involves undefined overflow (reassoc doesn't associate signed arithmetic either) > > > > RTL CSE could catch it, but for it the association would have to > > be the same for both. If we start from the proposed form > > then at RTL expansion time we could associate > > pointer + (X + CST) to (pointer + X) + CST. > > Right. > > > > > Feels all somewhat hacky, of course (and relies on TER). There > > may be cases where doing the opposite is better (for example > > if you have ptr1 + (X + 1000) and ptr2 + (X + 1000)). Association > > to make CSE possible is always hard if CSE itself cannot associate > > to maximize the number of CSE opportunities. So at the moment > > any choice is just canonicalization. > > Exactly my thought. I'm not sure if that's what you have in mind when you > write > association for CSE but I was thinking about a scheme that ressemble what > tree_to_aff_combination_expand does and organize all expanded expression to > compare them easily (read efficiently). With such a capability it would then > not be necessary to do the first replacement with forprop+reassoc+dom as > everything could be done in CSE. Yeah, but that's not how CSE on GIMPLE or RTL works right now ;) (patches welcome?) I suppose teaching reassoc to look for CSE opportunities may be easier (needs separating analysis and transform stages for the whole function). Richard.
[Bug tree-optimization/60172] [4.9/4.10 Regression] ARM performance regression from trunk@207239
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172 --- Comment #20 from Thomas Preud'homme --- (In reply to rguent...@suse.de from comment #19) > On Thu, 15 May 2014, thomas.preudhomme at arm dot com wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172 > > > > --- Comment #18 from Thomas Preud'homme > > --- > > (In reply to Richard Biener from comment #17) > > > > > > Citing myself: > > > > > > On the GIMPLE level before expansion we have > > > > > > +40 = Arr_2_Par_Ref_22(D) + (_41 + pretmp_20); > > > > > > _51 = Arr_2_Par_Ref_22(D) + (_41 + (pretmp_20 + 1000)); > > > > > > so if _51 were Arr_2_Par_Ref_22(D) + ((_41 + pretmp_20) + 1000); > > > > > > then _41 + pretmp_20 would be fully redundant with the expression needed > > > by _40. > > > > Yes I saw that but I was wondering why would reassoc try this association > > rather than another since the header of the file doesn't mention any special > > treatment of explicit integer constants. > > > > Besides, wouldn't it still misses that fact that _51 = _40 + 1000? > > Yes. But reassoc doesn't associate across POINTER_PLUS_EXPRs. Is there a reason for that? > > RTL CSE could catch it, but for it the association would have to > be the same for both. If we start from the proposed form > then at RTL expansion time we could associate > pointer + (X + CST) to (pointer + X) + CST. Right. > > Feels all somewhat hacky, of course (and relies on TER). There > may be cases where doing the opposite is better (for example > if you have ptr1 + (X + 1000) and ptr2 + (X + 1000)). Association > to make CSE possible is always hard if CSE itself cannot associate > to maximize the number of CSE opportunities. So at the moment > any choice is just canonicalization. Exactly my thought. I'm not sure if that's what you have in mind when you write association for CSE but I was thinking about a scheme that ressemble what tree_to_aff_combination_expand does and organize all expanded expression to compare them easily (read efficiently). With such a capability it would then not be necessary to do the first replacement with forprop+reassoc+dom as everything could be done in CSE.
[Bug tree-optimization/60172] [4.9/4.10 Regression] ARM performance regression from trunk@207239
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172 --- Comment #19 from rguenther at suse dot de --- On Thu, 15 May 2014, thomas.preudhomme at arm dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172 > > --- Comment #18 from Thomas Preud'homme --- > (In reply to Richard Biener from comment #17) > > > > Citing myself: > > > > On the GIMPLE level before expansion we have > > > > +40 = Arr_2_Par_Ref_22(D) + (_41 + pretmp_20); > > > > _51 = Arr_2_Par_Ref_22(D) + (_41 + (pretmp_20 + 1000)); > > > > so if _51 were Arr_2_Par_Ref_22(D) + ((_41 + pretmp_20) + 1000); > > > > then _41 + pretmp_20 would be fully redundant with the expression needed > > by _40. > > Yes I saw that but I was wondering why would reassoc try this association > rather than another since the header of the file doesn't mention any special > treatment of explicit integer constants. > > Besides, wouldn't it still misses that fact that _51 = _40 + 1000? Yes. But reassoc doesn't associate across POINTER_PLUS_EXPRs. RTL CSE could catch it, but for it the association would have to be the same for both. If we start from the proposed form then at RTL expansion time we could associate pointer + (X + CST) to (pointer + X) + CST. Feels all somewhat hacky, of course (and relies on TER). There may be cases where doing the opposite is better (for example if you have ptr1 + (X + 1000) and ptr2 + (X + 1000)). Association to make CSE possible is always hard if CSE itself cannot associate to maximize the number of CSE opportunities. So at the moment any choice is just canonicalization. > > Note that IIRC one issue with TER is that it is no longer happening as > > there are dead stmts around that confuse its has_single_use logic. Thus > > placing a dce pass right before expand would fix that and might be a good > > idea anyway (see comment #3). Implementing a "proper" poor-mans SSA-based > > DCE would be a good way out (out-of-SSA already has one to remove dead > > PHIs). > > Ok
[Bug tree-optimization/60172] [4.9/4.10 Regression] ARM performance regression from trunk@207239
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172 --- Comment #18 from Thomas Preud'homme --- (In reply to Richard Biener from comment #17) > > Citing myself: > > On the GIMPLE level before expansion we have > > +40 = Arr_2_Par_Ref_22(D) + (_41 + pretmp_20); > > _51 = Arr_2_Par_Ref_22(D) + (_41 + (pretmp_20 + 1000)); > > so if _51 were Arr_2_Par_Ref_22(D) + ((_41 + pretmp_20) + 1000); > > then _41 + pretmp_20 would be fully redundant with the expression needed > by _40. Yes I saw that but I was wondering why would reassoc try this association rather than another since the header of the file doesn't mention any special treatment of explicit integer constants. Besides, wouldn't it still misses that fact that _51 = _40 + 1000? > > Note that IIRC one issue with TER is that it is no longer happening as > there are dead stmts around that confuse its has_single_use logic. Thus > placing a dce pass right before expand would fix that and might be a good > idea anyway (see comment #3). Implementing a "proper" poor-mans SSA-based > DCE would be a good way out (out-of-SSA already has one to remove dead > PHIs). Ok
[Bug tree-optimization/60172] [4.9/4.10 Regression] ARM performance regression from trunk@207239
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172 --- Comment #17 from Richard Biener --- (In reply to Thomas Preud'homme from comment #16) > Hi Richard, > > could you expand on what you said in comment #13? I don't see how reassoc > could help cse here. From what I understood, reassoc tries to group per > rank. In our case, we have (view of the source with loop unrolling): > > Arr_2_Par_Ref [Int_Loc] [Int_Loc] = Int_Loc; > /* some stmts */ > Arr_2_Par_Ref [Int_Loc+10] [Int_Loc] = Arr_1_Par_Ref [Int_Loc]; > > If I'm not mistaken, in the first case you'd have: > > Int_Loc * 4 > Int_Loc * 100 > Arr_2_Par_Ref > > that would be added together with several statements. However in the second > case you'd have: > > Int_Loc * 4 > Int_Loc * 100 > 1000 > Arr_2_Par_Ref > > that would be added together with several statements. I don't see how could > 1000 being added first or last, it seems to me that it's always going to be > in an intermediate statement and thus not all redanduncy would be eliminated > by CSE. > > Please let me know if my reasonning is flawed so that I can progress toward > a solution. Citing myself: On the GIMPLE level before expansion we have +40 = Arr_2_Par_Ref_22(D) + (_41 + pretmp_20); _51 = Arr_2_Par_Ref_22(D) + (_41 + (pretmp_20 + 1000)); so if _51 were Arr_2_Par_Ref_22(D) + ((_41 + pretmp_20) + 1000); then _41 + pretmp_20 would be fully redundant with the expression needed by _40. Note that IIRC one issue with TER is that it is no longer happening as there are dead stmts around that confuse its has_single_use logic. Thus placing a dce pass right before expand would fix that and might be a good idea anyway (see comment #3). Implementing a "proper" poor-mans SSA-based DCE would be a good way out (out-of-SSA already has one to remove dead PHIs).
[Bug tree-optimization/60172] [4.9/4.10 Regression] ARM performance regression from trunk@207239
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172 --- Comment #16 from Thomas Preud'homme --- Hi Richard, could you expand on what you said in comment #13? I don't see how reassoc could help cse here. From what I understood, reassoc tries to group per rank. In our case, we have (view of the source with loop unrolling): Arr_2_Par_Ref [Int_Loc] [Int_Loc] = Int_Loc; /* some stmts */ Arr_2_Par_Ref [Int_Loc+10] [Int_Loc] = Arr_1_Par_Ref [Int_Loc]; If I'm not mistaken, in the first case you'd have: Int_Loc * 4 Int_Loc * 100 Arr_2_Par_Ref that would be added together with several statements. However in the second case you'd have: Int_Loc * 4 Int_Loc * 100 1000 Arr_2_Par_Ref that would be added together with several statements. I don't see how could 1000 being added first or last, it seems to me that it's always going to be in an intermediate statement and thus not all redanduncy would be eliminated by CSE. Please let me know if my reasonning is flawed so that I can progress toward a solution.
[Bug tree-optimization/60172] [4.9/4.10 Regression] ARM performance regression from trunk@207239
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172 --- Comment #15 from rguenther at suse dot de --- On Fri, 9 May 2014, thomas.preudhomme at arm dot com wrote: > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172 > > Thomas Preud'homme changed: > >What|Removed |Added > > CC||thomas.preudhomme at arm dot > com > > --- Comment #14 from Thomas Preud'homme --- > (In reply to Steven Bosscher from comment #12) > > Annotated "bad expansion": > > ;; _40 = Arr_2_Par_Ref_22(D) + _12; > > 22: r138=r128+r121 > > 23: r127=r132+r138 // r127=Arr_2_Par_Ref+r128+r121 > > > > ;; _32 = _20 + 1000; > > 29: r124=r121+1000 > > > > ;; MEM[(int[25] *)_51 + 20B] = _34; > > 32: r141=r132+r124 // r141=Arr_2_Par_Ref+r121+1000 > > 33: r142=r141+r128 // r142=Arr_2_Par_Ref+r128+r121+1000 (==r127+1000) > > 34: [r142+20]=r126 > > So in gimple the two offsets are added first and then added to the pointer > while after expansion the first offset is added to the pointer and then the > second offset. Is it normal that the order of operations seems to change? Yes, that's TER at work
[Bug tree-optimization/60172] [4.9/4.10 Regression] ARM performance regression from trunk@207239
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172 Thomas Preud'homme changed: What|Removed |Added CC||thomas.preudhomme at arm dot com --- Comment #14 from Thomas Preud'homme --- (In reply to Steven Bosscher from comment #12) > Annotated "bad expansion": > ;; _40 = Arr_2_Par_Ref_22(D) + _12; > 22: r138=r128+r121 > 23: r127=r132+r138 // r127=Arr_2_Par_Ref+r128+r121 > > ;; _32 = _20 + 1000; > 29: r124=r121+1000 > > ;; MEM[(int[25] *)_51 + 20B] = _34; > 32: r141=r132+r124 // r141=Arr_2_Par_Ref+r121+1000 > 33: r142=r141+r128 // r142=Arr_2_Par_Ref+r128+r121+1000 (==r127+1000) > 34: [r142+20]=r126 So in gimple the two offsets are added first and then added to the pointer while after expansion the first offset is added to the pointer and then the second offset. Is it normal that the order of operations seems to change?
[Bug tree-optimization/60172] [4.9/4.10 Regression] ARM performance regression from trunk@207239
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172 Richard Biener changed: What|Removed |Added Target Milestone|--- |4.9.1 Summary|[4.9 regression] ARM|[4.9/4.10 Regression] ARM |performance regression from |performance regression from |trunk@207239|trunk@207239