[Bug tree-optimization/89582] Suboptimal code generated for floating point struct in -O3 compare to -O2

rguenth at gcc dot gnu.org Wed, 03 Apr 2019 05:52:25 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89582


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
             Status|NEW                         |ASSIGNED
                 CC|                            |jakub at gcc dot gnu.org,
                   |                            |law at gcc dot gnu.org,
                   |                            |rsandifo at gcc dot gnu.org
          Component|target                      |tree-optimization
           Assignee|unassigned at gcc dot gnu.org      |rguenth at gcc dot 
gnu.org

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
So compared to the already mitigated PR84101 this one returns in

(parallel:TI [
        (expr_list:REG_DEP_TRUE (reg:DF 20 xmm0)
            (const_int 0 [0]))
        (expr_list:REG_DEP_TRUE (reg:DF 21 xmm1)
            (const_int 8 [0x8]))
    ])

so I wonder how targets represent if they return the _same_ value in
two different locations.  I also wonder whether the above is any
standard form.

Covering the above in the same way as PR84101 doesn't prevent vectorization:

t.c:8:14: note:  Cost model analysis:
  Vector inside of basic block cost: 48
  Vector prologue cost: 0
  Vector epilogue cost: 36
  Scalar cost of basic block: 96
t.c:8:14: note:  Basic block will be vectorized using SLP

even though we've added one vector store and two scalar loads to pessimize it.
Here the vector construction from the argument passing comes into play
which also isn't "pessimized" accordingly because the vectorizer thinks it
can perform vector loads from memory here, replacing four scalar loads
(but in fact x.x1, y.x1, x.x2 and y.x2 are readily available in %xmm0 to 3).

Another interesting testcase is

typedef struct {
        float x1;
        float x2;
        float x3;
        float x4;
} vfloat __attribute__((aligned(16)));

vfloat f(vfloat x, vfloat y)
{
  return (vfloat){x.x1 + y.x1, x.x2 + y.x2, x.x3 + y.x3, x.x4 + y.x4};
}

where the _scalar_ code is really awkward (even though for the vector
case the vectorizer still thinks it is all memory).  Passing two
floats in a single 8-bytes is a really odd ABI choice:

f:
.LFB0:
        .cfi_startproc
        movq    %xmm0, %rdi
        movq    %xmm2, %r9
        movq    %xmm1, %rdx
        movq    %xmm3, %rcx
        shrq    $32, %rdi
        addss   %xmm2, %xmm0
        shrq    $32, %r9
        shrq    $32, %rdx
        movd    %edi, %xmm5
        shrq    $32, %rcx
        movd    %r9d, %xmm6
        movd    %edx, %xmm7
        addss   %xmm6, %xmm5
        movd    %ecx, %xmm4
        movaps  %xmm1, %xmm6
        addss   %xmm4, %xmm7
        addss   %xmm3, %xmm6
        movd    %xmm5, %eax
        movq    %rax, %rsi
        movd    %xmm7, %edx
        movd    %xmm0, %eax
        salq    $32, %rsi
        salq    $32, %rdx
        movd    %xmm6, %ecx
        orq     %rsi, %rax
        orq     %rcx, %rdx
        movq    %rdx, %xmm1
        movq    %rax, %xmm0
        ret

vs. vectorized:

f:
.LFB0:
        .cfi_startproc
        movq    %xmm1, -32(%rsp)
        movq    %xmm0, -40(%rsp)
        movaps  -40(%rsp), %xmm4
        movq    %xmm2, -24(%rsp)
        movq    %xmm3, -16(%rsp)
        addps   -24(%rsp), %xmm4
        movaps  %xmm4, -40(%rsp)
        movq    -32(%rsp), %rax
        movq    -40(%rsp), %xmm0
        movq    %rax, %xmm1
        movq    %rax, -24(%rsp)
        ret

here the spilling is still bad (STLF penalties) but in the end vectorizing
this makes sense and we want to preserve that.

Lame attempt to handle PARALLELs:

Index: gcc/tree-vect-stmts.c
===================================================================
--- gcc/tree-vect-stmts.c       (revision 270123)
+++ gcc/tree-vect-stmts.c       (working copy)
@@ -1076,23 +1076,24 @@ vect_model_store_cost (stmt_vec_info stm
       && !aggregate_value_p (base, cfun->decl))
     {
       rtx reg = hard_function_value (TREE_TYPE (base), cfun->decl, 0, 1);
-      /* ???  Handle PARALLEL in some way.  */
+      int nregs = 1;
       if (REG_P (reg))
+       nregs = hard_regno_nregs (REGNO (reg), GET_MODE (reg));
+      /* ???  Handle PARALLEL better (see PR89582 for an example).  */
+      else if (GET_CODE (reg) == PARALLEL)
+       nregs = XVECLEN (reg, 0);
+      /* Assume that a single reg-reg move is possible and cheap,
+        do not account for vector to gp register move cost.  */
+      if (nregs > 1)
        {
-         int nregs = hard_regno_nregs (REGNO (reg), GET_MODE (reg));
-         /* Assume that a single reg-reg move is possible and cheap,
-            do not account for vector to gp register move cost.  */
-         if (nregs > 1)
-           {
-             /* Spill.  */
-             prologue_cost += record_stmt_cost (cost_vec, ncopies,
-                                                vector_store,
-                                                stmt_info, 0, vect_epilogue);
-             /* Loads.  */
-             prologue_cost += record_stmt_cost (cost_vec, ncopies * nregs,
-                                                scalar_load,
-                                                stmt_info, 0, vect_epilogue);
-           }
+         /* Spill.  */
+         prologue_cost += record_stmt_cost (cost_vec, ncopies,
+                                            vector_store,
+                                            stmt_info, 0, vect_epilogue);
+         /* Loads.  */
+         prologue_cost += record_stmt_cost (cost_vec, ncopies * nregs,
+                                            scalar_load,
+                                            stmt_info, 0, vect_epilogue);
        }
     }

[Bug tree-optimization/89582] Suboptimal code generated for floating point struct in -O3 compare to -O2

Reply via email to