[Bug tree-optimization/67323] Use non-unit stride loads by preference when applicable
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323 Richard Biener changed: What|Removed |Added Status|NEW |ASSIGNED
[Bug tree-optimization/67323] Use non-unit stride loads by preference when applicable
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323 Richard Biener changed: What|Removed |Added Status|RESOLVED|NEW Resolution|FIXED |--- --- Comment #10 from Richard Biener --- (In reply to Michael Collison from comment #8) > Hi Richard, > > I tried this with trunk and was unable to generate the vld3. What vectorizer > options did you use? Ah, I just assumed it was fixed because the patch for PR68707 was checked in. But that conditions the "fix" on the SLP needing permutations which doesn't trigger here. Let's re-open then. As asked in that other PR the question is if vld3/std3 is really cheaper (it's definitely smaller code).
[Bug tree-optimization/67323] Use non-unit stride loads by preference when applicable
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323 --- Comment #11 from Michael Collison --- Andrew, It may be the case that is not a win on all microarchitectures however I think we should allow the vectorizer to (optionally) generate the vld3 and deal with the differences via the cost models.
[Bug tree-optimization/67323] Use non-unit stride loads by preference when applicable
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323 --- Comment #8 from Michael Collison --- Hi Richard, I tried this with trunk and was unable to generate the vld3. What vectorizer options did you use?
[Bug tree-optimization/67323] Use non-unit stride loads by preference when applicable
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323 --- Comment #9 from Andrew Pinski --- Note the question comes here is which is better using ldr/str followed by a few mult or ld3/st3 followed by a few shifts/adds. I think it depends on the micro-arch really (at least for aarch32). In fact I think ldr/str followed by a few mult is much better for ThunderX and most likely also Cortex-A57 (at least that is how I read the optimizing manual).
[Bug tree-optimization/67323] Use non-unit stride loads by preference when applicable
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323 Bug 67323 depends on bug 68707, which changed state. Bug 68707 Summary: [6 Regression] testcase gcc.dg/vect/O3-pr36098.c vectorized using VEC_PERM_EXPR rather than VEC_LOAD_LANES https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68707 What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED
[Bug tree-optimization/67323] Use non-unit stride loads by preference when applicable
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323 Richard Biener changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #7 from Richard Biener --- Should be fixed now.
[Bug tree-optimization/67323] Use non-unit stride loads by preference when applicable
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323 --- Comment #6 from Richard Biener --- Author: rguenth Date: Mon Dec 14 15:33:20 2015 New Revision: 231620 URL: https://gcc.gnu.org/viewcvs?rev=231620=gcc=rev Log: 2015-12-10 Richard BienerPR tree-optimization/68707 PR tree-optimization/67323 * tree-vect-slp.c (vect_analyze_slp_instance): Drop SLP instances if they can be vectorized using load/store-lane instructions. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-vect-slp.c
[Bug tree-optimization/67323] Use non-unit stride loads by preference when applicable
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323 Bug 67323 depends on bug 66721, which changed state. Bug 66721 Summary: [6 regression] gcc.target/i386/pr61403.c FAILs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66721 What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED
[Bug tree-optimization/67323] Use non-unit stride loads by preference when applicable
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323 --- Comment #5 from Richard Biener --- I note that the efficiency you gain is only by a reduced number of loads/store instructions. vld3 instead of six vldr (huh, appearantly vld3 can load 16 byte vectors but vldr only 8 byte ones?). I assume vld3 has no penalty for the lane-split itself so the code-size reduction is always wanted. Thus we'd want to always use a lane load/store even if the permutation is pointless as soon as we'd otherwise would issue more than one SLP load, say for void t5 (int len, int * __restrict p, int * __restrict q) { for (int i = 0; i < len; i+=8) { p[i] = q[i] * 2; p[i+1] = q[i+1] * 2; p[i+2] = q[i+2] * 2; p[i+3] = q[i+3] * 2; p[i+4] = q[i+4] * 2; p[i+5] = q[i+5] * 2; p[i+6] = q[i+6] * 2; p[i+7] = q[i+7] * 2; } } instead of .L4: vldrd18, [r2, #-16] vldrd19, [r2, #-8] vldrd16, [r2, #-32] vldrd17, [r2, #-24] vshl.i32q9, q9, #1 vshl.i32q8, q8, #1 add r3, r3, #1 cmp r0, r3 vstrd18, [r1, #-16] vstrd19, [r1, #-8] vstrd16, [r1, #-32] vstrd17, [r1, #-24] add r2, r2, #32 add r1, r1, #32 bhi .L4 use vld2.32 / vst2.32? Generally for SLP the implicit permute performed by those instructions could be modeled properly (and the SLP chain permuted accordingly).
[Bug tree-optimization/67323] Use non-unit stride loads by preference when applicable
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323 Richard Biener rguenth at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2015-08-25 CC|richard.guenther at gmail dot com |rguenth at gcc dot gnu.org Depends on||66721 Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #1 from Richard Biener rguenth at gcc dot gnu.org --- Confirmed. We go down the SLP path here because the vectorizer thinks that SLP is always cheaper than using interleaving (which generally is true if there were not targets which can do the load plus interleave with load-lanes ...). I think this may be a regression as well because I enhanced SLP to apply to way more cases. Note that my plan is to make the vectorizer consider both (well, not really, but this bug shows I maybe should try), SLP and non-SLP, and evaluate based on costs which route to go. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66721 [Bug 66721] [6 regression] gcc.target/i386/pr61403.c FAILs
[Bug tree-optimization/67323] Use non-unit stride loads by preference when applicable
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323 --- Comment #2 from Michael Collison michael.collison at linaro dot org --- Richard, Should I create a test case that fails until you resolve this in GCC 6? On 08/25/2015 02:14 AM, rguenth at gcc dot gnu.org wrote: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323 Richard Biener rguenth at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2015-08-25 CC|richard.guenther at gmail dot com |rguenth at gcc dot gnu.org Depends on||66721 Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #1 from Richard Biener rguenth at gcc dot gnu.org --- Confirmed. We go down the SLP path here because the vectorizer thinks that SLP is always cheaper than using interleaving (which generally is true if there were not targets which can do the load plus interleave with load-lanes ...). I think this may be a regression as well because I enhanced SLP to apply to way more cases. Note that my plan is to make the vectorizer consider both (well, not really, but this bug shows I maybe should try), SLP and non-SLP, and evaluate based on costs which route to go. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66721 [Bug 66721] [6 regression] gcc.target/i386/pr61403.c FAILs
[Bug tree-optimization/67323] Use non-unit stride loads by preference when applicable
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323 --- Comment #3 from rguenther at suse dot de rguenther at suse dot de --- On Tue, 25 Aug 2015, michael.collison at linaro dot org wrote: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323 --- Comment #2 from Michael Collison michael.collison at linaro dot org --- Richard, Should I create a test case that fails until you resolve this in GCC 6? If you can provide one that I can check in together with a fix that would be nice. Having it in the tree now and FAILing isn't according to our policies. On 08/25/2015 02:14 AM, rguenth at gcc dot gnu.org wrote: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323 Richard Biener rguenth at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2015-08-25 CC|richard.guenther at gmail dot com |rguenth at gcc dot gnu.org Depends on||66721 Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #1 from Richard Biener rguenth at gcc dot gnu.org --- Confirmed. We go down the SLP path here because the vectorizer thinks that SLP is always cheaper than using interleaving (which generally is true if there were not targets which can do the load plus interleave with load-lanes ...). I think this may be a regression as well because I enhanced SLP to apply to way more cases. Note that my plan is to make the vectorizer consider both (well, not really, but this bug shows I maybe should try), SLP and non-SLP, and evaluate based on costs which route to go. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66721 [Bug 66721] [6 regression] gcc.target/i386/pr61403.c FAILs
[Bug tree-optimization/67323] Use non-unit stride loads by preference when applicable
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323 --- Comment #4 from Michael Collison michael.collison at linaro dot org --- Hi Richard, No I do not have a fix now. Thanks for the info on the policy. On 08/25/2015 03:05 AM, rguenther at suse dot de wrote: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323 --- Comment #3 from rguenther at suse dot de rguenther at suse dot de --- On Tue, 25 Aug 2015, michael.collison at linaro dot org wrote: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323 --- Comment #2 from Michael Collison michael.collison at linaro dot org --- Richard, Should I create a test case that fails until you resolve this in GCC 6? If you can provide one that I can check in together with a fix that would be nice. Having it in the tree now and FAILing isn't according to our policies. On 08/25/2015 02:14 AM, rguenth at gcc dot gnu.org wrote: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323 Richard Biener rguenth at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2015-08-25 CC|richard.guenther at gmail dot com |rguenth at gcc dot gnu.org Depends on||66721 Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #1 from Richard Biener rguenth at gcc dot gnu.org --- Confirmed. We go down the SLP path here because the vectorizer thinks that SLP is always cheaper than using interleaving (which generally is true if there were not targets which can do the load plus interleave with load-lanes ...). I think this may be a regression as well because I enhanced SLP to apply to way more cases. Note that my plan is to make the vectorizer consider both (well, not really, but this bug shows I maybe should try), SLP and non-SLP, and evaluate based on costs which route to go. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66721 [Bug 66721] [6 regression] gcc.target/i386/pr61403.c FAILs