[Bug tree-optimization/99971] GCC generates partially vectorized and scalar code at once

2021-04-23 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971

--- Comment #11 from rguenther at suse dot de  ---
On Fri, 23 Apr 2021, andysem at mail dot ru wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971
> 
> --- Comment #10 from andysem at mail dot ru ---
> Thanks. Will this be backported to 10 and 11 branches?

I don't plan to since it isn't a regression as far as I know, it
doesn't apply to GCC 10 so definitely not there.  I'll consider
for GCC 11.

[Bug tree-optimization/99971] GCC generates partially vectorized and scalar code at once

2021-04-23 Thread andysem at mail dot ru via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971

--- Comment #10 from andysem at mail dot ru ---
Thanks. Will this be backported to 10 and 11 branches?

[Bug tree-optimization/99971] GCC generates partially vectorized and scalar code at once

2021-04-23 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971

Richard Biener  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED
  Known to work||12.0

--- Comment #9 from Richard Biener  ---
Fixed for GCC 12.

[Bug tree-optimization/99971] GCC generates partially vectorized and scalar code at once

2021-04-23 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971

--- Comment #8 from CVS Commits  ---
The master branch has been updated by Richard Biener :

https://gcc.gnu.org/g:700e542971251b11623cce877075567815f72965

commit r12-79-g700e542971251b11623cce877075567815f72965
Author: Richard Biener 
Date:   Fri Apr 9 09:35:51 2021 +0200

tree-optimization/99971 - improve BB vect dependence analysis

We can use TBAA even when we have a DR, do so.  For the testcase
that means fully vectorizing it instead of only vectorizing
the first store group resulting in suboptimal code.

2021-04-09  Richard Biener  

PR tree-optimization/99971
* tree-vect-data-refs.c (vect_slp_analyze_node_dependences):
Always use TBAA for loads.

* g++.dg/vect/slp-pr99971.cc: New testcase.

[Bug tree-optimization/99971] GCC generates partially vectorized and scalar code at once

2021-04-15 Thread david.bolvansky at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971

Dávid Bolvanský  changed:

   What|Removed |Added

 CC||david.bolvansky at gmail dot 
com

--- Comment #7 from Dávid Bolvanský  ---
Still bad for -O3 -march=skylake-avx512

https://godbolt.org/z/azb8aTG43

[Bug tree-optimization/99971] GCC generates partially vectorized and scalar code at once

2021-04-15 Thread andysem at mail dot ru via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971

--- Comment #6 from andysem at mail dot ru ---
Hmm, it looks like the original code has changed enough so that the problem no
longer reproduces, with or without __restrict__. I don't have the older version
of the code, so I can't tell what changed exactly. Data alignment most probably
did change, but data layout of A (its equivalent in the original code) as well
as the operation on it certainly didn't. Sorry for the confusion.

[Bug tree-optimization/99971] GCC generates partially vectorized and scalar code at once

2021-04-15 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971

--- Comment #5 from Richard Biener  ---
(In reply to Richard Biener from comment #4)
> (In reply to andysem from comment #3)
> > I tried adding __restrict__ to the equivalents of x, y1 and y2 in the
> > original larger code base and it didn't help. The compiler (gcc 10.2) would
> > still generate the same half-vectorized code.
> 
> Hmm, that's odd.  I suppose the equivalent of test() was inlined in the
> larger code base?
> 
> I'd be interested in preprocessed source of a translation unit that exhibits
> this issue (and a pointer to the point in the source that is relevant).
> 
> Note for GCC 12 I have a patch to improve things w/o requiring the use
> of __restrict (and I'm curious on whether that helps for the larger code
> base).

https://gcc.gnu.org/pipermail/gcc-patches/2021-April/567805.html

is the patch which applies to current master.

[Bug tree-optimization/99971] GCC generates partially vectorized and scalar code at once

2021-04-15 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971

--- Comment #4 from Richard Biener  ---
(In reply to andysem from comment #3)
> I tried adding __restrict__ to the equivalents of x, y1 and y2 in the
> original larger code base and it didn't help. The compiler (gcc 10.2) would
> still generate the same half-vectorized code.

Hmm, that's odd.  I suppose the equivalent of test() was inlined in the
larger code base?

I'd be interested in preprocessed source of a translation unit that exhibits
this issue (and a pointer to the point in the source that is relevant).

Note for GCC 12 I have a patch to improve things w/o requiring the use
of __restrict (and I'm curious on whether that helps for the larger code base).

[Bug tree-optimization/99971] GCC generates partially vectorized and scalar code at once

2021-04-15 Thread andysem at mail dot ru via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971

--- Comment #3 from andysem at mail dot ru ---
I tried adding __restrict__ to the equivalents of x, y1 and y2 in the original
larger code base and it didn't help. The compiler (gcc 10.2) would still
generate the same half-vectorized code.

[Bug tree-optimization/99971] GCC generates partially vectorized and scalar code at once

2021-04-09 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971

Richard Biener  changed:

   What|Removed |Added

   Keywords||missed-optimization
   Assignee|unassigned at gcc dot gnu.org  |rguenth at gcc dot 
gnu.org
 Ever confirmed|0   |1
   Last reconfirmed||2021-04-09
 Status|UNCONFIRMED |ASSIGNED

--- Comment #2 from Richard Biener  ---
Confirmed.  While we manage to analyze for the "perfect" solution" we fail
because dependence testing doesn't handle a piece, this throws away half
of the vectorization.  We do actually see that we'll retain the scalar
loads and computations but still doing three vector loads and a vector add
seems cheaper than doing four scalar stores:

0x1fdb5a0 x_2(D)->a 1 times unaligned_load (misalign -1) costs 12 in body
0x1fdb5a0 y1_3(D)->a 1 times unaligned_load (misalign -1) costs 12 in body
0x1fdb5a0 _13 + _14 1 times vector_stmt costs 4 in body
0x1fdb5a0 _15 1 times unaligned_store (misalign -1) costs 12 in body
0x1fddcb0 _15 1 times scalar_store costs 12 in body
0x1fddcb0 _18 1 times scalar_store costs 12 in body
0x1fddcb0 _21 1 times scalar_store costs 12 in body
0x1fddcb0 _24 1 times scalar_store costs 12 in body
t.C:28:1: note:  Cost model analysis:
  Vector inside of basic block cost: 40
  Vector prologue cost: 0
  Vector epilogue cost: 0
  Scalar cost of basic block: 48
t.C:28:1: note:  Basic block will be vectorized using SLP

now, fortunately GCC 11 will improve on this [a bit] and we'll produce

_Z4testR1ARKS_S2_:
.LFB2:
.cfi_startproc
movdqu  (%rsi), %xmm0
movdqu  (%rdi), %xmm1
paddd   %xmm1, %xmm0
movups  %xmm0, (%rdi)
movd%xmm0, %eax
subl(%rdx), %eax
movl%eax, (%rdi)
pextrd  $1, %xmm0, %eax
subl4(%rdx), %eax
movl%eax, 4(%rdi)
pextrd  $2, %xmm0, %eax
subl8(%rdx), %eax
movl%eax, 8(%rdi)
pextrd  $3, %xmm0, %eax
subl12(%rdx), %eax
movl%eax, 12(%rdi)
ret

which is not re-doing the scalar loads/adds but instead uses the vector
result.  Still the same dependence issue is present:

t.C:16:11: missed:   can't determine dependence between y1_3(D)->b and
x_2(D)->a
t.C:16:11: note:  removing SLP instance operations starting from: x_2(D)->a =
_6;

the scalar code before vectorization looks like

   [local count: 1073741824]:
  _13 = x_2(D)->a;
  _14 = y1_3(D)->a;
  _15 = _13 + _14;
  x_2(D)->a = _15;
  _16 = x_2(D)->b;
  _17 = y1_3(D)->b;  <---
  _18 = _16 + _17;
  x_2(D)->b = _18;
  _19 = x_2(D)->c;
  _20 = y1_3(D)->c;
  _21 = _19 + _20;
  x_2(D)->c = _21;
  _22 = x_2(D)->d;
  _23 = y1_3(D)->d;
  _24 = _22 + _23;
  x_2(D)->d = _24;
  _5 = y2_4(D)->a;
  _6 = _15 - _5;
  x_2(D)->a = _6;  <---
  _7 = y2_4(D)->b;
  _8 = _18 - _7;
  x_2(D)->b = _8;
  _9 = y2_4(D)->c;
  _10 = _21 - _9;
  x_2(D)->c = _10;
  _11 = y2_4(D)->d;
  _12 = _24 - _11;
  x_2(D)->d = _12;
  return;


Using

void test(A& __restrict x, A const& y1, A const& y2)
{
x += y1;
x -= y2;
}

produces optimal assembly even with GCC 10:

_Z4testR1ARKS_S2_:
.LFB2:
.cfi_startproc
movdqu  (%rsi), %xmm0
movdqu  (%rdx), %xmm1
movdqu  (%rdi), %xmm2
psubd   %xmm1, %xmm0
paddd   %xmm2, %xmm0
movups  %xmm0, (%rdi)
ret

note that I think we should be able to handle the dependences even without
the __restrict annotation.

[Bug tree-optimization/99971] GCC generates partially vectorized and scalar code at once

2021-04-08 Thread andysem at mail dot ru via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971

--- Comment #1 from andysem at mail dot ru ---
For reference, an ideal version of this code should look something like this:

test(A&, A const&, A const&):
movdqu  (%rsi), %xmm0
movdqu  (%rdi), %xmm1
movdqu  (%rdx), %xmm2
paddd   %xmm1, %xmm0
psubd   %xmm2, %xmm0
movups  %xmm0, (%rdi)
ret