[Bug middle-end/65962] Missed vectorization of strided stores

2015-10-30 Thread clyon at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65962

--- Comment #20 from Christophe Lyon  ---
(In reply to Richard Biener from comment #19)

> whatever reason for.  So the testcase would need adjustment with hw_misalign.

check_effective_target_vect_hw_misalign needs to be updated to include arm
targets.

I guess I should add arm-*-*, such that vect-strided-a-u8-i2-gap.c (after
adding dg-require-effective-target vect_hw_misalign) still PASSes for arm-* and
is UNSUPPORTED for armeb-*.

I'm just wondering whether I should instead add arm*-*-*, if we want to
consider there is a bug with the armeb behavior.


[Bug middle-end/65962] Missed vectorization of strided stores

2015-10-29 Thread clyon at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65962

--- Comment #16 from Christophe Lyon  ---
Created attachment 36614
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36614=edit
Vectorizer dump for little endian


[Bug middle-end/65962] Missed vectorization of strided stores

2015-10-29 Thread clyon at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65962

--- Comment #18 from Christophe Lyon  ---
I've attached vectorizer dumps for LE and BE from trunk@229448.


[Bug middle-end/65962] Missed vectorization of strided stores

2015-10-29 Thread clyon at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65962

--- Comment #17 from Christophe Lyon  ---
Created attachment 36615
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36615=edit
Vectorizer dump for big endian


[Bug middle-end/65962] Missed vectorization of strided stores

2015-10-29 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65962

--- Comment #19 from Richard Biener  ---
Ok, so the difference is that BE doesn't support unaligned vector loads while
LE does. The targetm.vectorize.support_vector_misalignment hook has

static bool
arm_builtin_support_vector_misalignment (machine_mode mode,
 const_tree type, int misalignment,
 bool is_packed)
{
  if (TARGET_NEON && !BYTES_BIG_ENDIAN && unaligned_access)
{

whatever reason for.  So the testcase would need adjustment with hw_misalign.


[Bug middle-end/65962] Missed vectorization of strided stores

2015-10-29 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65962

--- Comment #15 from rguenther at suse dot de  ---
On Wed, 28 Oct 2015, clyon at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65962
> 
> Christophe Lyon  changed:
> 
>What|Removed |Added
> 
>  CC||clyon at gcc dot gnu.org
> 
> --- Comment #14 from Christophe Lyon  ---
> After r229172, I am also seeing:
> FAIL: gcc.dg/vect/vect-strided-a-u8-i2-gap.c -flto -ffat-lto-objects 
> scan-tree-dump-times vect "vectorized 2 loops" 1
> FAIL: gcc.dg/vect/vect-strided-a-u8-i2-gap.c scan-tree-dump-times vect
> "vectorized 2 loops" 1
> 
> on target armeb-none-linux-gnueabihf
> --with-cpu=cortex-a9
> --with-fpu=neon-fp16
> 
> 
> The test passes on arm-none-linux-gnueabihf

Can you please attach the vectorizer dump for the failing case?
Not sure what should be special about big-endian here...


[Bug middle-end/65962] Missed vectorization of strided stores

2015-10-28 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65962

--- Comment #13 from Richard Biener  ---
Author: rguenth
Date: Wed Oct 28 10:09:37 2015
New Revision: 229481

URL: https://gcc.gnu.org/viewcvs?rev=229481=gcc=rev
Log:
2015-10-28  Richard Biener  

PR tree-optimization/65962
* tree-ssa-pre.c (eliminate_dom_walker::before_dom_children):
Avoid creating loop carried dependences also for outer loops
of the loop a use to replace is in.

* gcc.dg/vect/vect-62.c: Adjust.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/testsuite/ChangeLog
trunk/gcc/testsuite/gcc.dg/vect/vect-62.c
trunk/gcc/tree-ssa-pre.c


[Bug middle-end/65962] Missed vectorization of strided stores

2015-10-28 Thread clyon at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65962

Christophe Lyon  changed:

   What|Removed |Added

 CC||clyon at gcc dot gnu.org

--- Comment #14 from Christophe Lyon  ---
After r229172, I am also seeing:
FAIL: gcc.dg/vect/vect-strided-a-u8-i2-gap.c -flto -ffat-lto-objects 
scan-tree-dump-times vect "vectorized 2 loops" 1
FAIL: gcc.dg/vect/vect-strided-a-u8-i2-gap.c scan-tree-dump-times vect
"vectorized 2 loops" 1

on target armeb-none-linux-gnueabihf
--with-cpu=cortex-a9
--with-fpu=neon-fp16


The test passes on arm-none-linux-gnueabihf


[Bug middle-end/65962] Missed vectorization of strided stores

2015-10-27 Thread wschmidt at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65962

--- Comment #12 from Bill Schmidt  ---
Ah, great!  More vectorization is good. :)  Thanks for looking into this so
quickly!

Bill


[Bug middle-end/65962] Missed vectorization of strided stores

2015-10-27 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65962

--- Comment #11 from Richard Biener  ---
The main difference is

+/home/wschmidt/gcc/gcc-mainline-base/gcc/testsuite/gcc.dg/vect/vect-62.c:39:3: 
note: LOOP VECTORIZED
+/home/wschmidt/gcc/gcc-mainline-base/gcc/testsuite/gcc.dg/vect/vect-62.c:39:3: 
note: OUTER LOOP VECTORIZED
...
-/home/wschmidt/gcc/gcc-mainline-base/gcc/testsuite/gcc.dg/vect/vect-62.c:9:5:
note: vectorized 1 loops in function.
+/home/wschmidt/gcc/gcc-mainline-base/gcc/testsuite/gcc.dg/vect/vect-62.c:9:5:
note: vectorized 2 loops in function.

so we now vectorize two loops.  The newly vectorized loop is

  /* Multidimensional array. Aligned. The "inner" dimensions
 are invariant in the inner loop. Vectorizable, but the
 vectorizer detects that everything is invariant and that
 the loop is better left untouched. (it should be optimized away). */
  for (i = 0; i < N; i++)
{
  for (j = 0; j < N; j++)
{
   ia[i][1][8] = ib[i];
}
}

on x86_64 the latch block is not empty - for some reason not so on ppc.
I suspect that if we had a cddce pass after loop invariant/store motion
(which should make the inner loop empty) we'd even remove the inner loop
and vectorize this regularly.

Ah, so on x86_64 we PREd ib[0] while on ppc the ib initializer is probably
in a constant pool entry.  Yes:

  :
  ib = *.LC0;

vs.

  :
  ib[0] = 0;
  ib[1] = 3;
  ib[2] = 6;
  ib[3] = 9;
...

The PRE heuristic to not confuse vectorization doesn't fire here.

I have a fix for that (and the testcase).


[Bug middle-end/65962] Missed vectorization of strided stores

2015-10-26 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65962

--- Comment #7 from Richard Biener  ---
(In reply to Bill Schmidt from comment #6)
> This commit (r229172) caused a vectorization failure for POWER:
> 
> FAIL: gcc.dg/vect/vect-62.c -flto -ffat-lto-objects  scan-tree-dump-times
> vect "vectorized 1 loops" 1
> FAIL: gcc.dg/vect/vect-62.c scan-tree-dump-times vect "vectorized 1 loops" 1
> 
> Seems like an odd result, but that's what the bisection shows...

Huh.  Can you please attach vectorizer dumps before/after?


[Bug middle-end/65962] Missed vectorization of strided stores

2015-10-26 Thread wschmidt at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65962

--- Comment #8 from Bill Schmidt  ---
Created attachment 36589
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36589=edit
Vectorization dump for r229171


[Bug middle-end/65962] Missed vectorization of strided stores

2015-10-26 Thread wschmidt at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65962

--- Comment #9 from Bill Schmidt  ---
Created attachment 36590
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36590=edit
Vectorization dump for r229172


[Bug middle-end/65962] Missed vectorization of strided stores

2015-10-26 Thread wschmidt at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65962

--- Comment #10 from Bill Schmidt  ---
Above are the before-and-after vectorization dumps.  I haven't looked at them
in any detail myself yet.


[Bug middle-end/65962] Missed vectorization of strided stores

2015-10-23 Thread wschmidt at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65962

Bill Schmidt  changed:

   What|Removed |Added

 CC||bergner at gcc dot gnu.org,
   ||wschmidt at gcc dot gnu.org

--- Comment #6 from Bill Schmidt  ---
This commit (r229172) caused a vectorization failure for POWER:

FAIL: gcc.dg/vect/vect-62.c -flto -ffat-lto-objects  scan-tree-dump-times vect
"vectorized 1 loops" 1
FAIL: gcc.dg/vect/vect-62.c scan-tree-dump-times vect "vectorized 1 loops" 1

Seems like an odd result, but that's what the bisection shows...


[Bug middle-end/65962] Missed vectorization of strided stores

2015-10-22 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65962

Richard Biener  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |rguenth at gcc dot 
gnu.org

--- Comment #3 from Richard Biener  ---
While strided stores are now implemented the case is still not handled because
single-element interleaving takes precedence (and single-element interleaving
isn't supported for stores as that always produces gaps).

I have a patch that produces

.L2:
movdqu  16(%rax), %xmm1
addq$32, %rax
movdqu  -32(%rax), %xmm0
shufps  $136, %xmm1, %xmm0
paddd   %xmm2, %xmm0
pshufd  $85, %xmm0, %xmm1
movd%xmm0, -32(%rax)
movd%xmm1, -24(%rax)
movdqa  %xmm0, %xmm1
punpckhdq   %xmm0, %xmm1
pshufd  $255, %xmm0, %xmm0
movd%xmm1, -16(%rax)
movd%xmm0, -8(%rax)
cmpq%rdx, %rax
jne .L2

when you disable the cost model.  Otherwise it's deemed not profitable.  Using
scatters for AVX could in theory make it profitable (not sure).

t.c:5:3: note: Cost model analysis:
  Vector inside of loop cost: 13
  Vector prologue cost: 1
  Vector epilogue cost: 12
  Scalar iteration cost: 3
  Scalar outside cost: 0
  Vector outside cost: 13
  prologue iterations: 0
  epilogue iterations: 4
t.c:5:3: note: cost model: the vector iteration cost = 13 divided by the scalar
iteration cost = 3 is greater or equal to the vectorization factor = 4.
t.c:5:3: note: not vectorized: vectorization not profitable.
t.c:5:3: note: not vectorized: vector version will never be profitable.

t.c:5:3: note: ==> examining statement: *_8 = _10;
t.c:5:3: note: vect_is_simple_use: operand _10
t.c:5:3: note: def_stmt: _10 = _9 + 7;
t.c:5:3: note: type of def: internal
t.c:5:3: note: vect_model_store_cost: inside_cost = 8, prologue_cost = 0 .

so the strided store has cost 8, that's 4 extracts plus 4 scalar stores.
With AVX we generate

vmovd   %xmm0, -32(%rax)
vpextrd $1, %xmm0, -24(%rax)
vpextrd $2, %xmm0, -16(%rax)
vpextrd $3, %xmm0, -8(%rax)

so it can combine extract and store, with SSE2 we get

pshufd  $85, %xmm0, %xmm1
movd%xmm0, -32(%rax)
movd%xmm1, -24(%rax)
movdqa  %xmm0, %xmm1
punpckhdq   %xmm0, %xmm1
pshufd  $255, %xmm0, %xmm0
movd%xmm1, -16(%rax)
movd%xmm0, -8(%rax)

which is even worse than expected ;)

As usual the cost model isn't target aware enough here (and it errs on the
conservative side here)


[Bug middle-end/65962] Missed vectorization of strided stores

2015-10-22 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65962

--- Comment #5 from Richard Biener  ---
Author: rguenth
Date: Thu Oct 22 13:33:17 2015
New Revision: 229172

URL: https://gcc.gnu.org/viewcvs?rev=229172=gcc=rev
Log:
2015-10-22  Richard Biener  

PR tree-optimization/19049
PR tree-optimization/65962
* tree-vect-data-refs.c (vect_analyze_group_access_1): Fall back
to strided accesses if single-element interleaving doesn't work.

* gcc.dg/vect/vect-strided-store-pr65962.c: New testcase.
* gcc.dg/vect/vect-63.c: Adjust.
* gcc.dg/vect/vect-70.c: Likewise.
* gcc.dg/vect/vect-strided-u8-i2-gap.c: Likewise.
* gcc.dg/vect/vect-strided-a-u8-i2-gap.c: Likewise.
* gfortran.dg/vect/pr19049.f90: Likewise.
* gfortran.dg/vect/vect-8.f90: Likewise.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/testsuite/ChangeLog
trunk/gcc/testsuite/gcc.dg/vect/vect-63.c
trunk/gcc/testsuite/gcc.dg/vect/vect-70.c
trunk/gcc/testsuite/gcc.dg/vect/vect-strided-a-u8-i2-gap.c
trunk/gcc/testsuite/gcc.dg/vect/vect-strided-u8-i2-gap.c
trunk/gcc/testsuite/gfortran.dg/vect/pr19049.f90
trunk/gcc/testsuite/gfortran.dg/vect/vect-8.f90
trunk/gcc/tree-vect-data-refs.c


[Bug middle-end/65962] Missed vectorization of strided stores

2015-10-22 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65962

Richard Biener  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
  Known to work||6.0
 Resolution|--- |FIXED

--- Comment #4 from Richard Biener  ---
Fixed for GCC 6.


[Bug middle-end/65962] Missed vectorization of strided stores

2015-05-04 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65962

Richard Biener rguenth at gcc dot gnu.org changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2015-05-04
 CC||matz at gcc dot gnu.org
 Blocks||53947
 Ever confirmed|0   |1

--- Comment #2 from Richard Biener rguenth at gcc dot gnu.org ---
I believe Micha has patches for this?


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations


[Bug middle-end/65962] Missed vectorization of strided stores

2015-05-01 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65962

--- Comment #1 from alalaw01 at gcc dot gnu.org ---
I believe this is a known issue, but have not identified an existing PR.