[Bug target/79262] [6/7 Regression] load gap with store gap causing performance regression in 462.libquantum

2017-03-28 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79262

Richard Biener  changed:

   What|Removed |Added

   Priority|P3  |P2

[Bug target/79262] [6/7 Regression] load gap with store gap causing performance regression in 462.libquantum

2017-01-30 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79262

Richard Biener  changed:

   What|Removed |Added

  Component|tree-optimization   |target

--- Comment #2 from Richard Biener  ---
On x86_64 core-avx2 we get

t.c:18:3: note: Cost model analysis:
  Vector inside of loop cost: 9
  Vector prologue cost: 7
  Vector epilogue cost: 3
  Scalar iteration cost: 3
  Scalar outside cost: 6
  Vector outside cost: 10
  prologue iterations: 0
  epilogue iterations: 1
t.c:18:3: note: cost model: the vector iteration cost = 9 divided by the scalar
iteration cost = 3 is greater or equal to the vectorization factor = 2.
t.c:18:3: note: not vectorized: vectorization not profitable.

forcing avx128 and no cost model we'd get

.L4:
vmovdqu (%rax), %xmm0
vpunpcklqdq 16(%rax), %xmm0, %xmm0
addl$1, %ecx
addq$32, %rax
vpxor   %xmm1, %xmm0, %xmm0
vmovq   %xmm0, -32(%rax)
vpextrq $1, %xmm0, -16(%rax)
cmpl%r9d, %ecx
jb  .L4

vs.

.L3:
movslq  %edx, %rax
addl$1, %edx
salq$4, %rax
xorq%rdi, 8(%rsi,%rax)
cmpl%r8d, %edx
jge .L7

note that one of the issues with the scalar store cost model is that it re-uses
vec_to_scalar which was originally meant to be only used for vector reduction
result to scalar reg cost (aka zero on x86_64).  We failed to add a
vec_extract_element "simple" cost.

The avx256 code looks like

.L4:
vmovdqu (%rdx), %ymm0
vpunpcklqdq 32(%rdx), %ymm0, %ymm0
addl$1, %esi
addq$64, %rdx
vpermq  $216, %ymm0, %ymm0
vpxor   %ymm2, %ymm0, %ymm0
vmovq   %xmm0, -64(%rdx)
vpextrq $1, %xmm0, -48(%rdx)
vextracti128$0x1, %ymm0, %xmm0
vmovq   %xmm0, -32(%rdx)
vpextrq $1, %xmm0, -16(%rdx)
cmpl%r9d, %esi
jb  .L4

given x86_64 can successfully cost-model this (reject the vectorization) this
is a target issue.