[Bug tree-optimization/67323] Use non-unit stride loads by preference when applicable

2021-05-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323

Richard Biener  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED

[Bug tree-optimization/67323] Use non-unit stride loads by preference when applicable

2016-01-14 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323

Richard Biener  changed:

   What|Removed |Added

 Status|RESOLVED|NEW
 Resolution|FIXED   |---

--- Comment #10 from Richard Biener  ---
(In reply to Michael Collison from comment #8)
> Hi Richard,
> 
> I tried this with trunk and was unable to generate the vld3. What vectorizer
> options did you use?

Ah, I just assumed it was fixed because the patch for PR68707 was checked in.
But that conditions the "fix" on the SLP needing permutations which doesn't
trigger here.

Let's re-open then.

As asked in that other PR the question is if vld3/std3 is really cheaper
(it's definitely smaller code).

[Bug tree-optimization/67323] Use non-unit stride loads by preference when applicable

2016-01-14 Thread michael.collison at linaro dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323

--- Comment #11 from Michael Collison  ---
Andrew,

It may be the case that is not a win on all microarchitectures however I think
we should allow the vectorizer to (optionally) generate the vld3 and deal with
the differences via the cost models.

[Bug tree-optimization/67323] Use non-unit stride loads by preference when applicable

2016-01-13 Thread michael.collison at linaro dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323

--- Comment #8 from Michael Collison  ---
Hi Richard,

I tried this with trunk and was unable to generate the vld3. What vectorizer
options did you use?

[Bug tree-optimization/67323] Use non-unit stride loads by preference when applicable

2016-01-13 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323

--- Comment #9 from Andrew Pinski  ---
Note the question comes here is which is better using ldr/str followed by a few
mult or ld3/st3 followed by a few shifts/adds.  I think it depends on the
micro-arch really (at least for aarch32).  In fact I think ldr/str followed by
a few mult is much better for ThunderX and most likely also Cortex-A57 (at
least that is how I read the optimizing manual).

[Bug tree-optimization/67323] Use non-unit stride loads by preference when applicable

2016-01-12 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323
Bug 67323 depends on bug 68707, which changed state.

Bug 68707 Summary: [6 Regression] testcase gcc.dg/vect/O3-pr36098.c vectorized 
using VEC_PERM_EXPR rather than VEC_LOAD_LANES
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68707

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

[Bug tree-optimization/67323] Use non-unit stride loads by preference when applicable

2016-01-12 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323

Richard Biener  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #7 from Richard Biener  ---
Should be fixed now.

[Bug tree-optimization/67323] Use non-unit stride loads by preference when applicable

2015-12-14 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323

--- Comment #6 from Richard Biener  ---
Author: rguenth
Date: Mon Dec 14 15:33:20 2015
New Revision: 231620

URL: https://gcc.gnu.org/viewcvs?rev=231620=gcc=rev
Log:
2015-12-10  Richard Biener  

PR tree-optimization/68707
PR tree-optimization/67323
* tree-vect-slp.c (vect_analyze_slp_instance): Drop SLP instances
if they can be vectorized using load/store-lane instructions.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/tree-vect-slp.c

[Bug tree-optimization/67323] Use non-unit stride loads by preference when applicable

2015-11-26 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323
Bug 67323 depends on bug 66721, which changed state.

Bug 66721 Summary: [6 regression] gcc.target/i386/pr61403.c FAILs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66721

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

[Bug tree-optimization/67323] Use non-unit stride loads by preference when applicable

2015-10-07 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323

--- Comment #5 from Richard Biener  ---
I note that the efficiency you gain is only by a reduced number of loads/store
instructions.  vld3 instead of six vldr (huh, appearantly vld3 can load 16
byte vectors but vldr only 8 byte ones?).  I assume vld3 has no penalty
for the lane-split itself so the code-size reduction is always wanted.
Thus we'd want to always use a lane load/store even if the permutation is
pointless as soon as we'd otherwise would issue more than one SLP load, say for

void
t5 (int len, int * __restrict p, int * __restrict q)
{
  for (int i = 0; i < len; i+=8) {
  p[i] = q[i] * 2;
  p[i+1] = q[i+1] * 2;
  p[i+2] = q[i+2] * 2;
  p[i+3] = q[i+3] * 2;
  p[i+4] = q[i+4] * 2;
  p[i+5] = q[i+5] * 2;
  p[i+6] = q[i+6] * 2;
  p[i+7] = q[i+7] * 2;
  }
}

instead of

.L4:
vldrd18, [r2, #-16]
vldrd19, [r2, #-8]
vldrd16, [r2, #-32]
vldrd17, [r2, #-24]
vshl.i32q9, q9, #1
vshl.i32q8, q8, #1
add r3, r3, #1
cmp r0, r3
vstrd18, [r1, #-16]
vstrd19, [r1, #-8]
vstrd16, [r1, #-32]
vstrd17, [r1, #-24]
add r2, r2, #32
add r1, r1, #32
bhi .L4

use vld2.32 / vst2.32?  Generally for SLP the implicit permute performed
by those instructions could be modeled properly (and the SLP chain
permuted accordingly).


[Bug tree-optimization/67323] Use non-unit stride loads by preference when applicable

2015-08-25 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323

Richard Biener rguenth at gcc dot gnu.org changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2015-08-25
 CC|richard.guenther at gmail dot com  |rguenth at gcc dot 
gnu.org
 Depends on||66721
   Assignee|unassigned at gcc dot gnu.org  |rguenth at gcc dot 
gnu.org
 Ever confirmed|0   |1

--- Comment #1 from Richard Biener rguenth at gcc dot gnu.org ---
Confirmed.  We go down the SLP path here because the vectorizer thinks that
SLP is always cheaper than using interleaving (which generally is true
if there were not targets which can do the load plus interleave with
load-lanes ...).

I think this may be a regression as well because I enhanced SLP to apply
to way more cases.

Note that my plan is to make the vectorizer consider both (well, not really,
but this bug shows I maybe should try), SLP and non-SLP, and evaluate based
on costs which route to go.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66721
[Bug 66721] [6 regression] gcc.target/i386/pr61403.c FAILs


[Bug tree-optimization/67323] Use non-unit stride loads by preference when applicable

2015-08-25 Thread michael.collison at linaro dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323

--- Comment #2 from Michael Collison michael.collison at linaro dot org ---
Richard,

Should I create a test case that fails until you resolve this in GCC 6?

On 08/25/2015 02:14 AM, rguenth at gcc dot gnu.org wrote:
 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323

 Richard Biener rguenth at gcc dot gnu.org changed:

 What|Removed |Added
 
   Status|UNCONFIRMED |ASSIGNED
 Last reconfirmed||2015-08-25
   CC|richard.guenther at gmail dot com  |rguenth at gcc dot 
 gnu.org
   Depends on||66721
 Assignee|unassigned at gcc dot gnu.org  |rguenth at gcc dot 
 gnu.org
   Ever confirmed|0   |1

 --- Comment #1 from Richard Biener rguenth at gcc dot gnu.org ---
 Confirmed.  We go down the SLP path here because the vectorizer thinks that
 SLP is always cheaper than using interleaving (which generally is true
 if there were not targets which can do the load plus interleave with
 load-lanes ...).

 I think this may be a regression as well because I enhanced SLP to apply
 to way more cases.

 Note that my plan is to make the vectorizer consider both (well, not really,
 but this bug shows I maybe should try), SLP and non-SLP, and evaluate based
 on costs which route to go.


 Referenced Bugs:

 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66721
 [Bug 66721] [6 regression] gcc.target/i386/pr61403.c FAILs


[Bug tree-optimization/67323] Use non-unit stride loads by preference when applicable

2015-08-25 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323

--- Comment #3 from rguenther at suse dot de rguenther at suse dot de ---
On Tue, 25 Aug 2015, michael.collison at linaro dot org wrote:

 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323
 
 --- Comment #2 from Michael Collison michael.collison at linaro dot org ---
 Richard,
 
 Should I create a test case that fails until you resolve this in GCC 6?

If you can provide one that I can check in together with a fix that
would be nice.  Having it in the tree now and FAILing isn't according
to our policies.

 On 08/25/2015 02:14 AM, rguenth at gcc dot gnu.org wrote:
  https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323
 
  Richard Biener rguenth at gcc dot gnu.org changed:
 
  What|Removed |Added
  
Status|UNCONFIRMED |ASSIGNED
  Last reconfirmed||2015-08-25
CC|richard.guenther at gmail dot com  |rguenth at gcc dot 
  gnu.org
Depends on||66721
  Assignee|unassigned at gcc dot gnu.org  |rguenth at gcc dot 
  gnu.org
Ever confirmed|0   |1
 
  --- Comment #1 from Richard Biener rguenth at gcc dot gnu.org ---
  Confirmed.  We go down the SLP path here because the vectorizer thinks that
  SLP is always cheaper than using interleaving (which generally is true
  if there were not targets which can do the load plus interleave with
  load-lanes ...).
 
  I think this may be a regression as well because I enhanced SLP to apply
  to way more cases.
 
  Note that my plan is to make the vectorizer consider both (well, not really,
  but this bug shows I maybe should try), SLP and non-SLP, and evaluate based
  on costs which route to go.
 
 
  Referenced Bugs:
 
  https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66721
  [Bug 66721] [6 regression] gcc.target/i386/pr61403.c FAILs
 



[Bug tree-optimization/67323] Use non-unit stride loads by preference when applicable

2015-08-25 Thread michael.collison at linaro dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323

--- Comment #4 from Michael Collison michael.collison at linaro dot org ---
Hi Richard,

No I do not have a fix now. Thanks for the info on the policy.

On 08/25/2015 03:05 AM, rguenther at suse dot de wrote:
 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323

 --- Comment #3 from rguenther at suse dot de rguenther at suse dot de ---
 On Tue, 25 Aug 2015, michael.collison at linaro dot org wrote:

 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323

 --- Comment #2 from Michael Collison michael.collison at linaro dot org ---
 Richard,

 Should I create a test case that fails until you resolve this in GCC 6?
 If you can provide one that I can check in together with a fix that
 would be nice.  Having it in the tree now and FAILing isn't according
 to our policies.

 On 08/25/2015 02:14 AM, rguenth at gcc dot gnu.org wrote:
 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323

 Richard Biener rguenth at gcc dot gnu.org changed:

  What|Removed |Added
 
Status|UNCONFIRMED |ASSIGNED
  Last reconfirmed||2015-08-25
CC|richard.guenther at gmail dot com  |rguenth at gcc 
 dot gnu.org
Depends on||66721
  Assignee|unassigned at gcc dot gnu.org  |rguenth at gcc 
 dot gnu.org
Ever confirmed|0   |1

 --- Comment #1 from Richard Biener rguenth at gcc dot gnu.org ---
 Confirmed.  We go down the SLP path here because the vectorizer thinks that
 SLP is always cheaper than using interleaving (which generally is true
 if there were not targets which can do the load plus interleave with
 load-lanes ...).

 I think this may be a regression as well because I enhanced SLP to apply
 to way more cases.

 Note that my plan is to make the vectorizer consider both (well, not really,
 but this bug shows I maybe should try), SLP and non-SLP, and evaluate based
 on costs which route to go.


 Referenced Bugs:

 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66721
 [Bug 66721] [6 regression] gcc.target/i386/pr61403.c FAILs