https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87743

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2018-10-25
                 CC|                            |rguenth at gcc dot gnu.org
             Blocks|                            |53947
     Ever confirmed|0                           |1

--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
Confirmed.  It's a cost-model issue.  With GCC 7 the vectorization with AVX256
was not profitable so AVX128 was chosen:

t.c:12:1: note: Final SLP tree for instance:
t.c:12:1: note: node
t.c:12:1: note:         stmt 0 dst[0] = _11;
t.c:12:1: note:         stmt 1 dst[1] = _17;
t.c:12:1: note:         stmt 2 dst[2] = _23;
t.c:12:1: note:         stmt 3 dst[3] = _29;
t.c:12:1: note: node (external)
t.c:12:1: note:         stmt 0 _11 = (long long int) _10;
t.c:12:1: note:         stmt 1 _17 = (long long int) _16;
t.c:12:1: note:         stmt 2 _23 = (long long int) _22;
t.c:12:1: note:         stmt 3 _29 = (long long int) _28;
t.c:12:1: note: Cost model analysis:
  Vector inside of basic block cost: 2
  Vector prologue cost: 3
  Vector epilogue cost: 0
  Scalar cost of basic block: 4
t.c:12:1: note: not vectorized: vectorization is not profitable.
t.c:12:1: note: ***** Re-trying analysis with vector size 16

but with GCC 8 we now say

t.c:12:1: note: Cost model analysis:
  Vector inside of basic block cost: 20
  Vector prologue cost: 28
  Vector epilogue cost: 0
  Scalar cost of basic block: 48
t.c:12:1: note: Basic block will be vectorized using SLP
t.c:12:1: note: SLPing BB part

costs on trunk are the same (the above is for generic, for haswell the
vector cost is even lower, 12).

So we end up with

  <bb 2> [local count: 214748369]:
  _10 = src[0];
  _11 = (long long int) _10;
  _16 = src[1];
  _17 = (long long int) _16;
  _22 = src[2];
  _23 = (long long int) _22;
  _28 = src[3];
  _29 = (long long int) _28;
  _13 = {_11, _17, _23, _29};
  vect_cst__19 = _13;
  MEM[(long long int *)&dst] = vect_cst__19;

note this just costs the vector construction + vector store against
the four scalar stores.

Note with my patches to consider both vector sizes this wouldn't be handled
either since I didn't update them to work for BB vectorization (and they
are not on trunk yet anyways).  It would be an apples to oranges comparison
anyways since the scalar cost differs (the SLP tree is different for AVX128).
Anyways, costing for AVX128 is

t.c:12:1: note:  Cost model analysis:
  Vector inside of basic block cost: 44
  Vector prologue cost: 0
  Vector epilogue cost: 0
  Scalar cost of basic block: 96

(haswell).  So if you scale the vector cost by 0.5 because the scalar
cost is doubled you end up at 22 which would compare favorably to
12 + 28 == 40.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

Reply via email to