[Bug tree-optimization/57534] [6/7/8 Regression]: Performance regression versus 4.7.3, 4.8.1 is ~15% slower
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534 --- Comment #31 from rguenther at suse dot de --- On Wed, 28 Feb 2018, law at redhat dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534 > > --- Comment #27 from Jeffrey A. Law --- > WRT c#21. There is a good paper from Click on an integrated GVN/GCM/reassoc. > > "Combining Analyses, Combining Optimizations" It was published in PLDI. I've > probably got a copy here if you want it. I've googled the thesis but cannot find the paper - so yes, if you have a copy please send it my way.
[Bug tree-optimization/57534] [6/7/8 Regression]: Performance regression versus 4.7.3, 4.8.1 is ~15% slower
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534 Jeffrey A. Law changed: What|Removed |Added Target Milestone|6.5 |9.0 --- Comment #30 from Jeffrey A. Law --- OK. Moving to gcc-9. Note I think there's multiple instances of this issue that would likely be helped if we can get IVOPTS to DTRT here. I'll try to link them via "See Also" the next time I come across them.
[Bug tree-optimization/57534] [6/7/8 Regression]: Performance regression versus 4.7.3, 4.8.1 is ~15% slower
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534 --- Comment #29 from amker at gcc dot gnu.org --- (In reply to Jeffrey A. Law from comment #28) > BTW, ISTM that we need Bin to chime in on the complexity of improving this > in IVOPTS -- ie, is it gcc-8 or gcc-9 material. If the latter, then we > should adjust the target milestone. Yes, it's more like a GCC9 stuff. For the record, I think this could be generalized to cover addressing mode issue revealed by PR84037. That is, not only non-iv addresses, but also addresses after iv_rewrite could be refined for best selection of addressing mode.
[Bug tree-optimization/57534] [6/7/8 Regression]: Performance regression versus 4.7.3, 4.8.1 is ~15% slower
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534 --- Comment #28 from Jeffrey A. Law --- BTW, ISTM that we need Bin to chime in on the complexity of improving this in IVOPTS -- ie, is it gcc-8 or gcc-9 material. If the latter, then we should adjust the target milestone.
[Bug tree-optimization/57534] [6/7/8 Regression]: Performance regression versus 4.7.3, 4.8.1 is ~15% slower
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534 --- Comment #27 from Jeffrey A. Law --- WRT c#21. There is a good paper from Click on an integrated GVN/GCM/reassoc. "Combining Analyses, Combining Optimizations" It was published in PLDI. I've probably got a copy here if you want it.
[Bug tree-optimization/57534] [6/7/8 Regression]: Performance regression versus 4.7.3, 4.8.1 is ~15% slower
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534 --- Comment #26 from amker at gcc dot gnu.org --- Note the testcase: void timer_stop(); volatile long keepgoing = 0; double hand_benchmark_cache_ronly( double *x, long limit, long *oloops, double *ous) { long index = 0, loops = 0; double sum = (double)0; double sum2 = (double)0; again: sum += x[index] + x[index+1] + x[index+2] + x[index+3]; sum2 += x[index+4] + x[index+5] + x[index+6] + x[index+7]; if ((index += 8) < limit) goto again; else if (keepgoing) { index = 0; goto again; } timer_stop(); x[0] = (double)sum + (double)sum2; x[1] = (double)index; } equals to: { long index = 0, loops = 0; double sum = (double)0; double sum2 = (double)0; do { index = 0; do { sum += x[index] + x[index+1] + x[index+2] + x[index+3]; sum2 += x[index+4] + x[index+5] + x[index+6] + x[index+7]; } while ((index += 8) < limit); } while (keepgoing); timer_stop(); x[0] = (double)sum + (double)sum2; x[1] = (double)index; } Thus the inner loop can be well analyzed. Interesting thing is we need threading (or similar) to canonicalize the loop, a transformation corrupting loop structure often before.
[Bug tree-optimization/57534] [6/7/8 Regression]: Performance regression versus 4.7.3, 4.8.1 is ~15% slower
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534 --- Comment #25 from amker at gcc dot gnu.org --- (In reply to Aldy Hernandez from comment #23) > (In reply to amker from comment #22) > > > > So my suggestion would be to see if you can make SLSR generate > > > TARGET_MEM_REFs > > > based on some common infrastructure with IVOPTs. > > Yes, I also believe strength reduction infrastructure in IVOPTs could be > > (easily?) extended to cover this case. Either export interface from IVOPTs > > and do it in SLSR etc., or simply handle such loops specially in IVOPTs it > > self. Then it would be something like vect vs slp. I have time now and can > > play with this idea. > > Bin, do you want to take over the PR? Yes, I can study the IVOPTs part. Though not sure if this PR reveals opportunities in other passes or not. Thanks
[Bug tree-optimization/57534] [6/7/8 Regression]: Performance regression versus 4.7.3, 4.8.1 is ~15% slower
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534 --- Comment #24 from rguenther at suse dot de --- On Wed, 28 Feb 2018, amker at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534 > > --- Comment #22 from amker at gcc dot gnu.org --- > (In reply to Richard Biener from comment #21) > > One major flaw here is that IVOPTs does nothing on this loop because it > > doesn't find any induction variables. So basically this is highlighting the > > fact that > > we leave addressing mode computation to the RTL combiner. SLSR was supposed > > to be driving the way to do addressing mode computation on non-loopy code > > and what it does is make using autoinc/dec easier (but as you saw later > > forwprop passes wreck somewhat with this). It would be nice if IVOPTs > > could consider 'ind' as "induction variable" and be able to do addressing > > mode selection on scalar code. Bin, what would be required to do this? > it depends on scev for this. I don't think this can (or easily) be modeled in > scev, ind could be arbitrarily changed by cond(). But... Ah. Note that SCEV can now handle SESE regions (which includes a single basic-block). It should instantiate all ind+CST uses up to a common BB "invariant" ind SSA name. > > We'd basically consider all SSA names used in addresses as IVs (or maybe > > only multi-uses?). > > > > So my suggestion would be to see if you can make SLSR generate > > TARGET_MEM_REFs > > based on some common infrastructure with IVOPTs. > Yes, I also believe strength reduction infrastructure in IVOPTs could be > (easily?) extended to cover this case. Either export interface from IVOPTs > and > do it in SLSR etc., or simply handle such loops specially in IVOPTs it self. > Then it would be something like vect vs slp. I have time now and can play > with > this idea. Great!
[Bug tree-optimization/57534] [6/7/8 Regression]: Performance regression versus 4.7.3, 4.8.1 is ~15% slower
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534 --- Comment #23 from Aldy Hernandez --- (In reply to amker from comment #22) > > So my suggestion would be to see if you can make SLSR generate > > TARGET_MEM_REFs > > based on some common infrastructure with IVOPTs. > Yes, I also believe strength reduction infrastructure in IVOPTs could be > (easily?) extended to cover this case. Either export interface from IVOPTs > and do it in SLSR etc., or simply handle such loops specially in IVOPTs it > self. Then it would be something like vect vs slp. I have time now and can > play with this idea. Bin, do you want to take over the PR?
[Bug tree-optimization/57534] [6/7/8 Regression]: Performance regression versus 4.7.3, 4.8.1 is ~15% slower
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534 --- Comment #22 from amker at gcc dot gnu.org --- (In reply to Richard Biener from comment #21) > One major flaw here is that IVOPTs does nothing on this loop because it > doesn't find any induction variables. So basically this is highlighting the > fact that > we leave addressing mode computation to the RTL combiner. SLSR was supposed > to be driving the way to do addressing mode computation on non-loopy code > and what it does is make using autoinc/dec easier (but as you saw later > forwprop passes wreck somewhat with this). It would be nice if IVOPTs > could consider 'ind' as "induction variable" and be able to do addressing > mode selection on scalar code. Bin, what would be required to do this? it depends on scev for this. I don't think this can (or easily) be modeled in scev, ind could be arbitrarily changed by cond(). But... > We'd basically consider all SSA names used in addresses as IVs (or maybe > only multi-uses?). > > So my suggestion would be to see if you can make SLSR generate > TARGET_MEM_REFs > based on some common infrastructure with IVOPTs. Yes, I also believe strength reduction infrastructure in IVOPTs could be (easily?) extended to cover this case. Either export interface from IVOPTs and do it in SLSR etc., or simply handle such loops specially in IVOPTs it self. Then it would be something like vect vs slp. I have time now and can play with this idea. Thanks, >
[Bug tree-optimization/57534] [6/7/8 Regression]: Performance regression versus 4.7.3, 4.8.1 is ~15% slower
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534 Richard Biener changed: What|Removed |Added CC||amker at gcc dot gnu.org, ||rguenth at gcc dot gnu.org --- Comment #21 from Richard Biener --- One major flaw here is that IVOPTs does nothing on this loop because it doesn't find any induction variables. So basically this is highlighting the fact that we leave addressing mode computation to the RTL combiner. SLSR was supposed to be driving the way to do addressing mode computation on non-loopy code and what it does is make using autoinc/dec easier (but as you saw later forwprop passes wreck somewhat with this). It would be nice if IVOPTs could consider 'ind' as "induction variable" and be able to do addressing mode selection on scalar code. Bin, what would be required to do this? We'd basically consider all SSA names used in addresses as IVs (or maybe only multi-uses?). So my suggestion would be to see if you can make SLSR generate TARGET_MEM_REFs based on some common infrastructure with IVOPTs. I also wonder why we have leaq0(,%rdx,8), %rax movsd (%rbx,%rdx,8), %xmm1 addsd 8(%rbx,%rax), %xmm1 ... and not leaq0(,%rdx,8), %rax movsd (%rbx,%rax), %xmm1 addsd 8(%rbx,%rax), %xmm1 possibly RTL fwprop work. And then the question is whether using 8(%rbx,%rdx,8) would be really better -- IIRC the most complex addressing modes need more uops. Of course CSEing 0(%rbx,%rdx,8) ould have enabled to use (%rax), 8(%rax), etc. as Jeff says. So for the reassoc pass it's major issue is that it works locally (per single-use chain) rather than globally in conjunction with CSE. The only way it enables CSE is by sorting the chain against common criteria but that criteria is similarly "local". But I'm not aware of any global CSE + reassoc implementations / papers. Yes, I still want to explore "lowering" GIMPLE to -fwrapv (as RTL is) late in the pipeline (after loop and late VRP, some pass shuffling is necessary).
[Bug tree-optimization/57534] [6/7/8 Regression]: Performance regression versus 4.7.3, 4.8.1 is ~15% slower
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534 Aldy Hernandez changed: What|Removed |Added Target|i?86-*-*|i?86-*-*, x86-64 --- Comment #20 from Aldy Hernandez --- For the record, an even smaller test that I believe shows the problem even on x86-64: int ind; int cond(void); double hand_benchmark_cache_ronly( double *x) { double sum=0.0; while (cond()) sum += x[ind] + x[ind+1] + x[ind+2] + x[ind+3]; return sum; } with -O2 we get an extra lea in the loop: movslq ind(%rip), %rdx leaq0(,%rdx,8), %rax<-- BOO! movsd 8(%rbx,%rax), %xmm0 addsd (%rbx,%rdx,8), %xmm0 addsd 16(%rbx,%rax), %xmm0 addsd 24(%rbx,%rax), %xmm0 addsd 8(%rsp), %xmm0 movsd %xmm0, 8(%rsp) whereas with -O2 -fno-tree-slsr we get: movslq ind(%rip), %rax movsd 8(%rbx,%rax,8), %xmm0 addsd (%rbx,%rax,8), %xmm0 addsd 16(%rbx,%rax,8), %xmm0 addsd 24(%rbx,%rax,8), %xmm0 addsd 8(%rsp), %xmm0 movsd %xmm0, 8(%rsp) The .optimized dump for -O2 shows ind*8 being CSE'd away, and the address being calculated as "ind*8 + CST": _2 = (long unsigned int) ind.0_1; _3 = _2 * 8; ;; common expression: ind*8 _4 = x_26(D) + _3; _5 = *_4; _7 = _3 + 8; ;; ind*8 + 8 _8 = x_26(D) + _7; _9 = *_8; ... Whereas with -O2 -fno-tree-slsr, the address is calculated as "(ind+CST) * 8 + x" ind.0_1 = ind; _2 = (long unsigned int) ind.0_1; _3 = _2 * 8; _4 = x_26(D) + _3; _5 = *_4; _6 = _2 + 1; _7 = _6 * 8; ;; (ind+1) * 8 _8 = x_26(D) + _7;;; (ind+1) * 8 + x _9 = *_8; Ironically the -O2 gimple looks more efficient, but gets crappy addressing on x86.
[Bug tree-optimization/57534] [6/7/8 Regression]: Performance regression versus 4.7.3, 4.8.1 is ~15% slower
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534 --- Comment #19 from Marc Glisse --- (In reply to Aldy Hernandez from comment #16) > It seems tree-ssa-reassoc.c avoids reassociating most non-bit expressions by > design (because of signed overflows): [...] > So instead of whacking tree-ssa-reassoc.c to handle POINTER_PLUS_EXPR and > PLUS_EXPR, etc, There have been suggestions that we should do reassoc after some kind of lowering, where all integer operations wrap, maybe pointer operations are replaced with integer operations, etc.
[Bug tree-optimization/57534] [6/7/8 Regression]: Performance regression versus 4.7.3, 4.8.1 is ~15% slower
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534 --- Comment #18 from Jeffrey A. Law --- A couple more notes. It could also well be the case that reassociating in a way that encourages lea could be good for x86, but bad for other targets. I also suspect this is closely related to other BZs in the database where we've regressed due to poor addressing mode selections. If you're able to make progress here I'll dig them out and we can cross-check your patch against those other bugs.
[Bug tree-optimization/57534] [6/7/8 Regression]: Performance regression versus 4.7.3, 4.8.1 is ~15% slower
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534 --- Comment #17 from Jeffrey A. Law --- It could well end up being a case where we need to look to see if the expressions are likely to CSE to determine which is better. I'm not sure if reassoc has that kind of capability.
[Bug tree-optimization/57534] [6/7/8 Regression]: Performance regression versus 4.7.3, 4.8.1 is ~15% slower
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534 Aldy Hernandez changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |aldyh at gcc dot gnu.org --- Comment #16 from Aldy Hernandez --- > Coming out of SSA for hand_benchmark_cache_ronly(), we seem to be > calculating: > > ((index + 1) * 8) + x > ((index + 2) * 8) + x > ((index + 3) * 8) + x > etc > > After slsr we have: > > (index * 8) + x > (((index * 8) + 8) + x) > index * 8) + 8) + 8) + x) > > And finally after forwprop4: > > (index * 8) + x > (((index * 8) + 8) + x) > (((index * 8) + 16) + x) > > Are you suggesting we reassociate the above as: > > ((index * 8) + CONSTANT) + x Err, what I meant is that we should reassociate as (index * 8 + x) + CONSTANT. It seems tree-ssa-reassoc.c avoids reassociating most non-bit expressions by design (because of signed overflows): /* For non-bit or min/max operations we can't associate all types. Verify that here. */ (After the following: PR tree-optimization/45232 * tree-ssa-reassoc.c (can_reassociate_p): Disable re-association for types with undefined overflow. (reassociate_bb): Allow re-associating of bit and min/max operations for types with undefined overflow. * tree-ssa-forwprop.c (associate_plusminus): New function. ) The code that introduced the above, moved some arithmetic reassociation to reassociate_plusminus() in forwprop, which eventually landed in match.md. So instead of whacking tree-ssa-reassoc.c to handle POINTER_PLUS_EXPR and PLUS_EXPR, etc, perhaps we can reassociate from match.pd early in the compilation process. So, reassociate: (ind + 3) * 8 + x into: (8*ind + x) + 24 And pray for the best. I'll take a look.
[Bug tree-optimization/57534] [6/7/8 Regression]: Performance regression versus 4.7.3, 4.8.1 is ~15% slower
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534 --- Comment #15 from Aldy Hernandez --- (In reply to Jeffrey A. Law from comment #14) > I wonder if this could be addressed with a more reasonable address > computation reassociation. > ISTM that we associating as (x + (y + c)) which seems like a mistake in this > case. > > If we associated as (x + y) + c, then the (x + y) would become a common > subexpression eliminating a bunch of unnecessary address computations. Coming out of SSA for hand_benchmark_cache_ronly(), we seem to be calculating: ((index + 1) * 8) + x ((index + 2) * 8) + x ((index + 3) * 8) + x etc After slsr we have: (index * 8) + x (((index * 8) + 8) + x) index * 8) + 8) + 8) + x) And finally after forwprop4: (index * 8) + x (((index * 8) + 8) + x) (((index * 8) + 16) + x) Are you suggesting we reassociate the above as: ((index * 8) + CONSTANT) + x Is there a preferred place to put this kind of transformation? A separate pass after forwprop4? Much earlier? Is there a pass that already does this kind of juggling? BTW, it seems like a pass like tree-ssa-reassoc, tries to precisely convert: x + (y + constant) INTO (x + CONSTANT) + y which is the opposite of what we want.
[Bug tree-optimization/57534] [6/7/8 Regression]: Performance regression versus 4.7.3, 4.8.1 is ~15% slower
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534 Jeffrey A. Law changed: What|Removed |Added CC||aldyh at gcc dot gnu.org, ||law at redhat dot com --- Comment #14 from Jeffrey A. Law --- I wonder if this could be addressed with a more reasonable address computation reassociation. _14 = index.6_13 * 8; <- here _16 = x_15(D) + _14; _17 = *_16; _20 = _14 + 8; _21 = x_15(D) + _20; _22 = *_21; _23 = _17 + _22; _26 = _14 + 16; _27 = x_15(D) + _26; ISTM that we associating as (x + (y + c)) which seems like a mistake in this case. If we associated as (x + y) + c, then the (x + y) would become a common subexpression eliminating a bunch of unnecessary address computations. Of course, I'm sure there are cases where (x + (y + c)) is better. Though ISTM that generally (x + y) + c ought to be better for the addressing modes available on most targets. Anyway, something to think about.