[Bug rtl-optimization/69710] performance issue with SP Linpack with Autovectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69710 Richard Biener changed: What|Removed |Added Status|NEW |ASSIGNED
[Bug rtl-optimization/69710] performance issue with SP Linpack with Autovectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69710 Ramana Radhakrishnan changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2016-03-16 CC||ramana at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #16 from Ramana Radhakrishnan --- Confirmed then.
[Bug rtl-optimization/69710] performance issue with SP Linpack with Autovectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69710 --- Comment #15 from Doug Gilmore --- > I had a patch too, will send it for review in GCC7 if it's still needed. Sorry I got side track last week and didn't make much progress. Please go ahead and submit if you have something you feel comfortable with, I'll assist in testing. Thanks,
[Bug rtl-optimization/69710] performance issue with SP Linpack with Autovectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69710 --- Comment #14 from amker at gcc dot gnu.org --- (In reply to Doug Gilmore from comment #13) > I think this should be fairly straightforward to fix in the > autovectorization pass. Hopefully I should be able to post a patch > in the next few days. I had a patch too, will send it for review in GCC7 if it's still needed. Thanks.
[Bug rtl-optimization/69710] performance issue with SP Linpack with Autovectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69710 --- Comment #13 from Doug Gilmore --- I think this should be fairly straightforward to fix in the autovectorization pass. Hopefully I should be able to post a patch in the next few days.
[Bug rtl-optimization/69710] performance issue with SP Linpack with Autovectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69710 --- Comment #12 from Doug Gilmore --- > Yes, I proposed some cleanup passess after vectorization but richi > thinks it's genrally expensive. So what's implmentation complexity > of pass_dominator? One thing we might consider is only enable it when vectorization is run on architectures where cleanup is needed. I plan to send an RFC comment for my patch to see what objections there are to that approach, though beforehand I'd like to investigate what could be done to the vectorizer so that it doesn't generate code that contain false dependencies.
[Bug rtl-optimization/69710] performance issue with SP Linpack with Autovectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69710 --- Comment #11 from amker at gcc dot gnu.org --- (In reply to Doug Gilmore from comment #10) > Created attachment 37681 [details] > prototype fix > > > 1) we failed recognize that use 0 and 2 are identical to each other. > > This is because vectorizer generates redundant setup code in loop > > pre-header. There are two possible fixes here. One is to make > > expand_simple_operations more aggressive in expanding (used by > > ivopts) in tree-ssa-loop-niter.c. But I don't think this is a good > > idea in all cases, because expanded complicated expression makes ivo > > transform and niter analysis harder. > Or something along the lines of the attached patch, tested only on > the on the problem at hand. As it stands it is probably to heavy > handed to consider as a possible review candidate. Yes, I proposed some cleanup passess after vectorization but richi thinks it's genrally expensive. So what's implmentation complexity of pass_dominator? > > The other is to fix vectorizer > > to generate clean code. Richard's suggestion is to use gimple_build > > for that. > ISTM to be the reasonable approach but I haven't yet investigated > what's involved. > > Also the problem exists only for arm because it doesn't support > > [base+index] addressing mode for vect load/store. I guess mips > > doesn't either. > > > Right MIPS MSA doesn't support [base+index] mode. > > BTW, the reason why IVOPTS works for DP but not SP on MIPS MSA is > that the code in the pre-header is simpler for DP: > > : > vect_cst__52 = {da_6(D), da_6(D)}; > > : > # vectp_dy.8_46 = PHI > # vectp_dx.11_49 = PHI > # vectp_dy.16_55 = PHI > # ivtmp_58 = PHI <0(6), ivtmp_59(12)> > ... > which IVOPS can handle. Ah, so IMHO the code should be refined before IVO, we shouldn't put too much pressure which is not directly ivo related on ivo transform.
[Bug rtl-optimization/69710] performance issue with SP Linpack with Autovectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69710 --- Comment #10 from Doug Gilmore --- Created attachment 37681 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37681&action=edit prototype fix > 1) we failed recognize that use 0 and 2 are identical to each other. > This is because vectorizer generates redundant setup code in loop > pre-header. There are two possible fixes here. One is to make > expand_simple_operations more aggressive in expanding (used by > ivopts) in tree-ssa-loop-niter.c. But I don't think this is a good > idea in all cases, because expanded complicated expression makes ivo > transform and niter analysis harder. Or something along the lines of the attached patch, tested only on the on the problem at hand. As it stands it is probably to heavy handed to consider as a possible review candidate. > The other is to fix vectorizer > to generate clean code. Richard's suggestion is to use gimple_build > for that. ISTM to be the reasonable approach but I haven't yet investigated what's involved. > Also the problem exists only for arm because it doesn't support > [base+index] addressing mode for vect load/store. I guess mips > doesn't either. > Right MIPS MSA doesn't support [base+index] mode. BTW, the reason why IVOPTS works for DP but not SP on MIPS MSA is that the code in the pre-header is simpler for DP: : vect_cst__52 = {da_6(D), da_6(D)}; : # vectp_dy.8_46 = PHI # vectp_dx.11_49 = PHI # vectp_dy.16_55 = PHI # ivtmp_58 = PHI <0(6), ivtmp_59(12)> ... which IVOPS can handle.
[Bug rtl-optimization/69710] performance issue with SP Linpack with Autovectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69710 --- Comment #9 from amker at gcc dot gnu.org --- Also the problem exists only for arm because it doesn't support [base+index] addressing mode for vect load/store. I guess mips doesn't either.
[Bug rtl-optimization/69710] performance issue with SP Linpack with Autovectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69710 --- Comment #8 from amker at gcc dot gnu.org --- Reproduced on arm with saxpy.c. The dump for slp is as below: : _82 = prologue_after_cost_adjust.7_43 * 4; vectp_dy.13_81 = dy_9(D) + _82; _87 = prologue_after_cost_adjust.7_43 * 4; vectp_dx.16_86 = dx_13(D) + _87; vect_cst__91 = {da_6(D), da_6(D), da_6(D), da_6(D)}; _95 = prologue_after_cost_adjust.7_43 * 4; vectp_dy.21_94 = dy_9(D) + _95; : # vectp_dy.12_83 = PHI # vectp_dx.15_88 = PHI # vectp_dy.20_96 = PHI # ivtmp_99 = PHI <0(13), ivtmp_100(21)> vect__12.14_85 = MEM[(float *)vectp_dy.12_83]; vect__15.17_90 = MEM[(float *)vectp_dx.15_88]; vect__16.18_92 = vect_cst__91 * vect__15.17_90; vect__17.19_93 = vect__12.14_85 + vect__16.18_92; MEM[(float *)vectp_dy.20_96] = vect__17.19_93; vectp_dy.12_84 = vectp_dy.12_83 + 16; vectp_dx.15_89 = vectp_dx.15_88 + 16; vectp_dy.20_97 = vectp_dy.20_96 + 16; ivtmp_100 = ivtmp_99 + 1; if (ivtmp_100 < bnd.9_53) goto ; else goto ; : goto ; IVO recognized below uses: use 0 address in statement vect__12.14_85 = MEM[(float *)vectp_dy.12_83]; at position MEM[(float *)vectp_dy.12_83] type vector(4) float * base vectp_dy.13_81 step 16 base object (void *) vectp_dy.13_81 related candidates use 1 generic in statement vectp_dx.15_88 = PHI at position type vector(4) float * base vectp_dx.16_86 step 16 base object (void *) vectp_dx.16_86 is a biv related candidates use 2 address in statement MEM[(float *)vectp_dy.20_96] = vect__17.19_93; at position MEM[(float *)vectp_dy.20_96] type vector(4) float * base vectp_dy.21_94 step 16 base object (void *) vectp_dy.21_94 related candidates use 3 compare in statement if (ivtmp_100 < bnd.9_53) at position type unsigned int base 1 step 1 is a biv related candidates There are two problems: 1) we failed recognize that use 0 and 2 are identical to each other. This is because vectorizer generates redundant setup code in loop pre-header. There are two possible fixes here. One is to make expand_simple_operations more aggressive in expanding (used by ivopts) in tree-ssa-loop-niter.c. But I don't think this is a good idea in all cases, because expanded complicated expression makes ivo transform and niter analysis harder. The other is to fix vectorizer to generate clean code. Richard's suggestion is to use gimple_build for that. 2) use 1 is not recognized as an address iv because alignment of that memory reference.
[Bug rtl-optimization/69710] performance issue with SP Linpack with Autovectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69710 --- Comment #7 from amker at gcc dot gnu.org --- Hmm, the first problem is the two iv uses from dy load/store are not recognized as having same base address/object. This may caused by my patch disabling expansion of iv base. Or it exists all the time. This isn't critical, we can transform it on aaarch64 anyway. Another problem is more important that dx mem ref is not recognized as address type.
[Bug rtl-optimization/69710] performance issue with SP Linpack with Autovectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69710 amker at gcc dot gnu.org changed: What|Removed |Added CC||amker at gcc dot gnu.org --- Comment #6 from amker at gcc dot gnu.org --- I will have a look at this one.
[Bug rtl-optimization/69710] performance issue with SP Linpack with Autovectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69710 --- Comment #5 from Doug Gilmore --- Thanks for checking on AArch64 Andrew. BTW, I made my (incorrect) hunch by running a test on gcc113, where the installed 4.8 compile showed problems for both DP and SP. (I assumed that the problem was addressed on DP since we don't see it on MIPS at DP ToT with the MSA patch applied.) For Neon after ivopts I see: : # vectp_dy.20_96 = PHI # ivtmp.22_78 = PHI <0(13), ivtmp.22_77(21)> # ivtmp.26_112 = PHI # ivtmp.31_153 = PHI vectp_dx.15_88 = (vector(4) float *) ivtmp.26_112; _156 = (void *) ivtmp.31_153; vect__12.14_85 = MEM[base: _156, offset: 0B]; ivtmp.31_154 = ivtmp.31_153 + 16; vect__15.17_90 = MEM[(float *)vectp_dx.15_88]; vect__16.18_92 = vect_cst__91 * vect__15.17_90; vect__17.19_93 = vect__12.14_85 + vect__16.18_92; MEM[base: vectp_dy.20_96, offset: 0B] = vect__17.19_93; vectp_dy.20_97 = vectp_dy.20_96 + 16; ivtmp.22_77 = ivtmp.22_78 + 1; ivtmp.26_111 = ivtmp.26_112 + 16; if (ivtmp.22_77 < bnd.9_53) goto ; else goto ; ... : goto ; So the problem is indeed exposed on Neon.
[Bug rtl-optimization/69710] performance issue with SP Linpack with Autovectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69710 Andrew Pinski changed: What|Removed |Added Keywords||missed-optimization Component|tree-optimization |rtl-optimization --- Comment #4 from Andrew Pinski --- For me, the problem is only on the rtl level for the scheduling issue.