The following tries to improve PR92645 in a minimal invasive way. Currently as heuristic the BB vectorizer throws away vector stmts when all of the vector stmts need to be built via a vector CTOR. That makes sense unless the stmt only needs a single such vector CTOR which would still mean eliding N scalar stmts.
With this foo() in the testcase generates optimal code using unpacks (with some help of forwprop after BB SLP). Bootstrap and regtest running on x86_64-unknown-linux-gnu. Richard. 2019-11-25 Richard Biener <rguent...@suse.de> PR tree-optimization/92645 * tree-vect-slp.c (vect_build_slp_tree_2): For unary ops do not build the operation from scalars if the operand is. * gcc.target/i386/pr92645.c: New testcase. Index: gcc/tree-vect-slp.c =================================================================== --- gcc/tree-vect-slp.c (revision 278681) +++ gcc/tree-vect-slp.c (working copy) @@ -1410,10 +1411,11 @@ vect_build_slp_tree_2 (vec_info *vinfo, matches, npermutes, &this_tree_size, bst_map)) != NULL) { - /* If we have all children of child built up from scalars then just - throw that away and build it up this node from scalars. */ + /* If we have all children of a non-unary child built up from + scalars then just throw that away and build it up this node + from scalars. */ if (is_a <bb_vec_info> (vinfo) - && !SLP_TREE_CHILDREN (child).is_empty () + && SLP_TREE_CHILDREN (child).length () > 1 /* ??? Rejecting patterns this way doesn't work. We'd have to do extra work to cancel the pattern so the uses see the scalar version. */ @@ -1549,10 +1551,11 @@ vect_build_slp_tree_2 (vec_info *vinfo, tem, npermutes, &this_tree_size, bst_map)) != NULL) { - /* If we have all children of child built up from scalars then - just throw that away and build it up this node from scalars. */ + /* If we have all children of a non-unary child built up from + scalars then just throw that away and build it up this node + from scalars. */ if (is_a <bb_vec_info> (vinfo) - && !SLP_TREE_CHILDREN (child).is_empty () + && SLP_TREE_CHILDREN (child).length () > 1 /* ??? Rejecting patterns this way doesn't work. We'd have to do extra work to cancel the pattern so the uses see the scalar version. */ Index: gcc/testsuite/gcc.target/i386/pr92645.c =================================================================== --- gcc/testsuite/gcc.target/i386/pr92645.c (nonexistent) +++ gcc/testsuite/gcc.target/i386/pr92645.c (working copy) @@ -0,0 +1,36 @@ +/* { dg-do compile } */ +/* { dg-options "-O3 -fdump-tree-optimized -msse2 -Wno-psabi" } */ + +typedef unsigned short v8hi __attribute__((vector_size(16))); +typedef unsigned int v4si __attribute__((vector_size(16))); + +void bar (v4si *dst, v8hi * __restrict src) +{ + unsigned int tem[8]; + tem[0] = (*src)[0]; + tem[1] = (*src)[1]; + tem[2] = (*src)[2]; + tem[3] = (*src)[3]; + tem[4] = (*src)[4]; + tem[5] = (*src)[5]; + tem[6] = (*src)[6]; + tem[7] = (*src)[7]; + dst[0] = *(v4si *)tem; + dst[1] = *(v4si *)&tem[4]; +} +void foo (v4si *dst, v8hi src) +{ + unsigned int tem[8]; + tem[0] = src[0]; + tem[1] = src[1]; + tem[2] = src[2]; + tem[3] = src[3]; + tem[4] = src[4]; + tem[5] = src[5]; + tem[6] = src[6]; + tem[7] = src[7]; + dst[0] = *(v4si *)tem; + dst[1] = *(v4si *)&tem[4]; +} + +/* { dg-final { scan-tree-dump-times "vec_unpack_" 4 "optimized" } } */