The following tries to improve PR92645 in a minimal invasive way.
Currently as heuristic the BB vectorizer throws away vector stmts
when all of the vector stmts need to be built via a vector CTOR.
That makes sense unless the stmt only needs a single such vector CTOR
which would still mean eliding N scalar stmts.

With this foo() in the testcase generates optimal code using
unpacks (with some help of forwprop after BB SLP).

Bootstrap and regtest running on x86_64-unknown-linux-gnu.

Richard.

2019-11-25  Richard Biener  <rguent...@suse.de>

        PR tree-optimization/92645
        * tree-vect-slp.c (vect_build_slp_tree_2): For unary ops
        do not build the operation from scalars if the operand is.

        * gcc.target/i386/pr92645.c: New testcase.

Index: gcc/tree-vect-slp.c
===================================================================
--- gcc/tree-vect-slp.c (revision 278681)
+++ gcc/tree-vect-slp.c (working copy)
@@ -1410,10 +1411,11 @@ vect_build_slp_tree_2 (vec_info *vinfo,
                                        matches, npermutes,
                                        &this_tree_size, bst_map)) != NULL)
        {
-         /* If we have all children of child built up from scalars then just
-            throw that away and build it up this node from scalars.  */
+         /* If we have all children of a non-unary child built up from
+            scalars then just throw that away and build it up this node
+            from scalars.  */
          if (is_a <bb_vec_info> (vinfo)
-             && !SLP_TREE_CHILDREN (child).is_empty ()
+             && SLP_TREE_CHILDREN (child).length () > 1
              /* ???  Rejecting patterns this way doesn't work.  We'd have to
                 do extra work to cancel the pattern so the uses see the
                 scalar version.  */
@@ -1549,10 +1551,11 @@ vect_build_slp_tree_2 (vec_info *vinfo,
                                            tem, npermutes,
                                            &this_tree_size, bst_map)) != NULL)
            {
-             /* If we have all children of child built up from scalars then
-                just throw that away and build it up this node from scalars.  
*/
+             /* If we have all children of a non-unary child built up from
+                scalars then just throw that away and build it up this node
+                from scalars.  */
              if (is_a <bb_vec_info> (vinfo)
-                 && !SLP_TREE_CHILDREN (child).is_empty ()
+                 && SLP_TREE_CHILDREN (child).length () > 1
                  /* ???  Rejecting patterns this way doesn't work.  We'd have
                     to do extra work to cancel the pattern so the uses see the
                     scalar version.  */
Index: gcc/testsuite/gcc.target/i386/pr92645.c
===================================================================
--- gcc/testsuite/gcc.target/i386/pr92645.c     (nonexistent)
+++ gcc/testsuite/gcc.target/i386/pr92645.c     (working copy)
@@ -0,0 +1,36 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -fdump-tree-optimized -msse2 -Wno-psabi" } */
+
+typedef unsigned short v8hi __attribute__((vector_size(16)));
+typedef unsigned int v4si __attribute__((vector_size(16)));
+
+void bar (v4si *dst, v8hi * __restrict src)
+{
+  unsigned int tem[8];
+  tem[0] = (*src)[0];
+  tem[1] = (*src)[1];
+  tem[2] = (*src)[2];
+  tem[3] = (*src)[3];
+  tem[4] = (*src)[4];
+  tem[5] = (*src)[5];
+  tem[6] = (*src)[6];
+  tem[7] = (*src)[7];
+  dst[0] = *(v4si *)tem;
+  dst[1] = *(v4si *)&tem[4];
+}
+void foo (v4si *dst, v8hi src)
+{
+  unsigned int tem[8];
+  tem[0] = src[0];
+  tem[1] = src[1];
+  tem[2] = src[2];
+  tem[3] = src[3];
+  tem[4] = src[4];
+  tem[5] = src[5];
+  tem[6] = src[6];
+  tem[7] = src[7];
+  dst[0] = *(v4si *)tem;
+  dst[1] = *(v4si *)&tem[4];
+}
+
+/* { dg-final { scan-tree-dump-times "vec_unpack_" 4 "optimized" } } */

Reply via email to