https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89582
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Keywords| |missed-optimization Status|NEW |ASSIGNED CC| |jakub at gcc dot gnu.org, | |law at gcc dot gnu.org, | |rsandifo at gcc dot gnu.org Component|target |tree-optimization Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org --- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> --- So compared to the already mitigated PR84101 this one returns in (parallel:TI [ (expr_list:REG_DEP_TRUE (reg:DF 20 xmm0) (const_int 0 [0])) (expr_list:REG_DEP_TRUE (reg:DF 21 xmm1) (const_int 8 [0x8])) ]) so I wonder how targets represent if they return the _same_ value in two different locations. I also wonder whether the above is any standard form. Covering the above in the same way as PR84101 doesn't prevent vectorization: t.c:8:14: note: Cost model analysis: Vector inside of basic block cost: 48 Vector prologue cost: 0 Vector epilogue cost: 36 Scalar cost of basic block: 96 t.c:8:14: note: Basic block will be vectorized using SLP even though we've added one vector store and two scalar loads to pessimize it. Here the vector construction from the argument passing comes into play which also isn't "pessimized" accordingly because the vectorizer thinks it can perform vector loads from memory here, replacing four scalar loads (but in fact x.x1, y.x1, x.x2 and y.x2 are readily available in %xmm0 to 3). Another interesting testcase is typedef struct { float x1; float x2; float x3; float x4; } vfloat __attribute__((aligned(16))); vfloat f(vfloat x, vfloat y) { return (vfloat){x.x1 + y.x1, x.x2 + y.x2, x.x3 + y.x3, x.x4 + y.x4}; } where the _scalar_ code is really awkward (even though for the vector case the vectorizer still thinks it is all memory). Passing two floats in a single 8-bytes is a really odd ABI choice: f: .LFB0: .cfi_startproc movq %xmm0, %rdi movq %xmm2, %r9 movq %xmm1, %rdx movq %xmm3, %rcx shrq $32, %rdi addss %xmm2, %xmm0 shrq $32, %r9 shrq $32, %rdx movd %edi, %xmm5 shrq $32, %rcx movd %r9d, %xmm6 movd %edx, %xmm7 addss %xmm6, %xmm5 movd %ecx, %xmm4 movaps %xmm1, %xmm6 addss %xmm4, %xmm7 addss %xmm3, %xmm6 movd %xmm5, %eax movq %rax, %rsi movd %xmm7, %edx movd %xmm0, %eax salq $32, %rsi salq $32, %rdx movd %xmm6, %ecx orq %rsi, %rax orq %rcx, %rdx movq %rdx, %xmm1 movq %rax, %xmm0 ret vs. vectorized: f: .LFB0: .cfi_startproc movq %xmm1, -32(%rsp) movq %xmm0, -40(%rsp) movaps -40(%rsp), %xmm4 movq %xmm2, -24(%rsp) movq %xmm3, -16(%rsp) addps -24(%rsp), %xmm4 movaps %xmm4, -40(%rsp) movq -32(%rsp), %rax movq -40(%rsp), %xmm0 movq %rax, %xmm1 movq %rax, -24(%rsp) ret here the spilling is still bad (STLF penalties) but in the end vectorizing this makes sense and we want to preserve that. Lame attempt to handle PARALLELs: Index: gcc/tree-vect-stmts.c =================================================================== --- gcc/tree-vect-stmts.c (revision 270123) +++ gcc/tree-vect-stmts.c (working copy) @@ -1076,23 +1076,24 @@ vect_model_store_cost (stmt_vec_info stm && !aggregate_value_p (base, cfun->decl)) { rtx reg = hard_function_value (TREE_TYPE (base), cfun->decl, 0, 1); - /* ??? Handle PARALLEL in some way. */ + int nregs = 1; if (REG_P (reg)) + nregs = hard_regno_nregs (REGNO (reg), GET_MODE (reg)); + /* ??? Handle PARALLEL better (see PR89582 for an example). */ + else if (GET_CODE (reg) == PARALLEL) + nregs = XVECLEN (reg, 0); + /* Assume that a single reg-reg move is possible and cheap, + do not account for vector to gp register move cost. */ + if (nregs > 1) { - int nregs = hard_regno_nregs (REGNO (reg), GET_MODE (reg)); - /* Assume that a single reg-reg move is possible and cheap, - do not account for vector to gp register move cost. */ - if (nregs > 1) - { - /* Spill. */ - prologue_cost += record_stmt_cost (cost_vec, ncopies, - vector_store, - stmt_info, 0, vect_epilogue); - /* Loads. */ - prologue_cost += record_stmt_cost (cost_vec, ncopies * nregs, - scalar_load, - stmt_info, 0, vect_epilogue); - } + /* Spill. */ + prologue_cost += record_stmt_cost (cost_vec, ncopies, + vector_store, + stmt_info, 0, vect_epilogue); + /* Loads. */ + prologue_cost += record_stmt_cost (cost_vec, ncopies * nregs, + scalar_load, + stmt_info, 0, vect_epilogue); } }