https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118145
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|--- |14.3
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
Btw, I warned that having reduc_* patterns for two-lane vectors could have this
side-effect. On x86 specifically the load and store costs are comparatively
high compared to the operation cost, so saving one scalar load gets us a fairly
large "buffer" to use for slow vector ops.
On trunk with SSE4 we are using ptest + sete instead of moving to GPR:
_Z13canEncodeZeroPKh:
.LFB0:
.cfi_startproc
movdqu (%rdi), %xmm0
ptest %xmm0, %xmm0
sete %al
that might be superior - but it also shows costing is difficult. The
vectorizer itself does not consider the reduction result being used by
a comparison only.
For the plus it shows the saved scalar load is making up for the extra
stmts cost (and we don't consider code size or dependence chain length).
t.ii:6:12: note: Cost model analysis:
_4 + _5 1 times scalar_stmt costs 4 in body
MEM <unsigned long> [(char * {ref-all})buffer_3(D)] 1 times scalar_load costs
12 in body
MEM <unsigned long> [(char * {ref-all})buffer_3(D) + 8B] 1 times scalar_load
costs 12 in body
MEM <unsigned long> [(char * {ref-all})buffer_3(D)] 1 times unaligned_load
(misalign -1) costs 12 in body
_4 + _5 1 times vector_stmt costs 4 in body
_4 + _5 1 times vec_perm costs 4 in body
_4 + _5 1 times vec_to_scalar costs 4 in body
_4 + _5 0 times scalar_stmt costs 0 in body
t.ii:6:12: note: Cost model analysis for part in loop 0:
Vector cost: 24
Scalar cost: 28