https://gcc.gnu.org/bugzilla/show_bug.cgi?id=37150

--- Comment #19 from Richard Biener <rguenth at gcc dot gnu.org> ---
We do find SLP opportunities but in the end fail to vectorize with AVX2
because of

t.f90:158:0: note: BB vectorization with gaps at the end of a load is not
supported
t.f90:158:0: note: not vectorized: relevant stmt not supported: _1477 =
*pol_y_1422(D)[_675];
t.f90:158:0: note: removing SLP instance operations starting from: coef_x[0] =
_1604;

      /* ???  The following is overly pessimistic (as well as the loop
         case above) in the case we can statically determine the excess
         elements loaded are within the bounds of a decl that is accessed.
         Likewise for BB vectorizations using masked loads is a possibility. 
*/
      if (bb_vinfo && slp_perm && group_size % nunits != 0)
        {
          dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
                           "BB vectorization with gaps at the end of a load "
                           "is not supported\n");
          return false;
        }

this is possibly because we initially detect quite large groups which are
later split for proper SLP detection into smaller units (but that splitting
does only split the store groups, not the load groups that end up being used).
This means we get SLP permutation to trigger (looks even required for the
case I'm looking at which has a {4, 4, 5, 5} permutation but which obviously
only needs a single element and thus would have no issue with "gaps").

Basically this means how we perform load permutation in SLP should be rewritten
(and/or we should also try to split the load groups if all uses can agree
on a set -- remember we key groups on stmts and thus can't have multiple
groups for a stmt...).

We _do_ vectorize this with SSE2 vectors if you disable the cost model,
thus rejection is only because:

t.f90:158:0: note: Cost model analysis:
  Vector inside of basic block cost: 9408
  Vector prologue cost: 0
  Vector epilogue cost: 0
  Scalar cost of basic block: 616

note we have a lot of SLP instances in this basic block and thus some
of the cost analysis might be totally off (I suspect we are again confused
by the large load groups and our SLP permutation handling there).

              /* And adjust the number of loads performed.  */
              unsigned nunits
                = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info));
              ncopies_for_cost
                = (GROUP_SIZE (stmt_info) - GROUP_GAP (stmt_info)
                   + nunits - 1) / nunits;
              ncopies_for_cost *= SLP_INSTANCE_UNROLLING_FACTOR (instance);

first of all it doesn't consider CSE between the SLP instances, second
it also counts loads that will be dead after the permutation has been
applied.  So it's a very conservative estimate.  Let me see if I can
improve things here.  The vectorization _does_ seem to look profitable
(maybe you can benchmark with -fno-vect-cost-model?)

Reply via email to