https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123163
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |rguenth at gcc dot gnu.org
--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
*_3.x 1 times scalar_load costs 12 in epilogue
_4 + 18446744073709551608 1 times scalar_stmt costs 4 in epilogue
_5 1 times scalar_store costs 12 in epilogue
<unknown> 1 times cond_branch_taken costs 16 in epilogue
t.c:7:23: note: Cost model analysis:
Vector inside of loop cost: 64
Vector prologue cost: 12
Vector epilogue cost: 44
Scalar iteration cost: 28
Scalar outside cost: 32
Vector outside cost: 56
prologue iterations: 0
epilogue iterations: 1
t.c:7:23: missed: cost model: the vector iteration cost = 64 divided by the
scalar iteration cost = 28 is greater or equal to the vectorization factor = 2.
t.c:7:23: missed: not vectorized: vectorization not profitable.
t.c:7:23: missed: not vectorized: vector version will never be profitable.
t.c:7:23: missed: Loop costings may not be worthwhile.
the issue is the p[i].x are not contiguous but there's 'next' inbetween.
With just x86-64-v2, aka SSE, there's no benefit to perform scalar loads
of two pointers, compose a vector, subtract 8, and decompose for the
scalar stores. You'd get
.L4:
movdqu (%rax), %xmm0
pinsrq $1, 16(%rax), %xmm0
addq $32, %rax
paddq %xmm1, %xmm0
movq %xmm0, -32(%rax)
pextrq $1, %xmm0, -16(%rax)
cmpq %rax, %rdx
jne .L4
even w/ v3 (aka AVX2) you get
.L4:
vmovdqu (%rax), %ymm0
vpunpcklqdq 32(%rax), %ymm0, %ymm0
addq $64, %rax
vpermq $216, %ymm0, %ymm0
vpaddq %ymm2, %ymm0, %ymm0
vmovq %xmm0, -64(%rax)
vpextrq $1, %xmm0, -48(%rax)
vextracti128 $0x1, %ymm0, %xmm0
vmovq %xmm0, -32(%rax)
vpextrq $1, %xmm0, -16(%rax)
cmpq %rcx, %rax
jne .L4
and that's not deemed profitable either.
For 'baz' the issue is inded that with N == 16 you get all loops unrolled
and the vec[] temporary array elided, so the same issue as above.
So IMO it all works as intended?