https://gcc.gnu.org/bugzilla/show_bug.cgi?id=37150
--- Comment #19 from Richard Biener <rguenth at gcc dot gnu.org> --- We do find SLP opportunities but in the end fail to vectorize with AVX2 because of t.f90:158:0: note: BB vectorization with gaps at the end of a load is not supported t.f90:158:0: note: not vectorized: relevant stmt not supported: _1477 = *pol_y_1422(D)[_675]; t.f90:158:0: note: removing SLP instance operations starting from: coef_x[0] = _1604; /* ??? The following is overly pessimistic (as well as the loop case above) in the case we can statically determine the excess elements loaded are within the bounds of a decl that is accessed. Likewise for BB vectorizations using masked loads is a possibility. */ if (bb_vinfo && slp_perm && group_size % nunits != 0) { dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, "BB vectorization with gaps at the end of a load " "is not supported\n"); return false; } this is possibly because we initially detect quite large groups which are later split for proper SLP detection into smaller units (but that splitting does only split the store groups, not the load groups that end up being used). This means we get SLP permutation to trigger (looks even required for the case I'm looking at which has a {4, 4, 5, 5} permutation but which obviously only needs a single element and thus would have no issue with "gaps"). Basically this means how we perform load permutation in SLP should be rewritten (and/or we should also try to split the load groups if all uses can agree on a set -- remember we key groups on stmts and thus can't have multiple groups for a stmt...). We _do_ vectorize this with SSE2 vectors if you disable the cost model, thus rejection is only because: t.f90:158:0: note: Cost model analysis: Vector inside of basic block cost: 9408 Vector prologue cost: 0 Vector epilogue cost: 0 Scalar cost of basic block: 616 note we have a lot of SLP instances in this basic block and thus some of the cost analysis might be totally off (I suspect we are again confused by the large load groups and our SLP permutation handling there). /* And adjust the number of loads performed. */ unsigned nunits = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info)); ncopies_for_cost = (GROUP_SIZE (stmt_info) - GROUP_GAP (stmt_info) + nunits - 1) / nunits; ncopies_for_cost *= SLP_INSTANCE_UNROLLING_FACTOR (instance); first of all it doesn't consider CSE between the SLP instances, second it also counts loads that will be dead after the permutation has been applied. So it's a very conservative estimate. Let me see if I can improve things here. The vectorization _does_ seem to look profitable (maybe you can benchmark with -fno-vect-cost-model?)