https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79262
Richard Biener changed:
What|Removed |Added
Component|tree-optimization |target
--- Comment #2 from Richard Biener ---
On x86_64 core-avx2 we get
t.c:18:3: note: Cost model analysis:
Vector inside of loop cost: 9
Vector prologue cost: 7
Vector epilogue cost: 3
Scalar iteration cost: 3
Scalar outside cost: 6
Vector outside cost: 10
prologue iterations: 0
epilogue iterations: 1
t.c:18:3: note: cost model: the vector iteration cost = 9 divided by the scalar
iteration cost = 3 is greater or equal to the vectorization factor = 2.
t.c:18:3: note: not vectorized: vectorization not profitable.
forcing avx128 and no cost model we'd get
.L4:
vmovdqu (%rax), %xmm0
vpunpcklqdq 16(%rax), %xmm0, %xmm0
addl$1, %ecx
addq$32, %rax
vpxor %xmm1, %xmm0, %xmm0
vmovq %xmm0, -32(%rax)
vpextrq $1, %xmm0, -16(%rax)
cmpl%r9d, %ecx
jb .L4
vs.
.L3:
movslq %edx, %rax
addl$1, %edx
salq$4, %rax
xorq%rdi, 8(%rsi,%rax)
cmpl%r8d, %edx
jge .L7
note that one of the issues with the scalar store cost model is that it re-uses
vec_to_scalar which was originally meant to be only used for vector reduction
result to scalar reg cost (aka zero on x86_64). We failed to add a
vec_extract_element "simple" cost.
The avx256 code looks like
.L4:
vmovdqu (%rdx), %ymm0
vpunpcklqdq 32(%rdx), %ymm0, %ymm0
addl$1, %esi
addq$64, %rdx
vpermq $216, %ymm0, %ymm0
vpxor %ymm2, %ymm0, %ymm0
vmovq %xmm0, -64(%rdx)
vpextrq $1, %xmm0, -48(%rdx)
vextracti128$0x1, %ymm0, %xmm0
vmovq %xmm0, -32(%rdx)
vpextrq $1, %xmm0, -16(%rdx)
cmpl%r9d, %esi
jb .L4
given x86_64 can successfully cost-model this (reject the vectorization) this
is a target issue.