https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102421
--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> --- On x86 get_group_load_store_type ends up with VMAT_ELEMENTWISE because /* As a last resort, trying using a gather load or scatter store. ??? Although the code can handle all group sizes correctly, it probably isn't a win to use separate strided accesses based on nearby locations. Or, even if it's a win over scalar code, it might not be a win over vectorizing at a lower VF, if that allows us to use contiguous accesses. */ if (*memory_access_type == VMAT_ELEMENTWISE && single_element_p && loop_vinfo && vect_use_strided_gather_scatters_p (stmt_info, loop_vinfo, masked_p, gs_info)) *memory_access_type = VMAT_GATHER_SCATTER; doesn't trigger since vect_use_strided_gather_scatters_p only supports gather IFNs, not builtins. And then we end up in if (*memory_access_type == VMAT_ELEMENTWISE && !STMT_VINFO_STRIDED_P (first_stmt_info) && !(stmt_info == DR_GROUP_FIRST_ELEMENT (stmt_info) && !DR_GROUP_NEXT_ELEMENT (stmt_info) && !pow2p_hwi (DR_GROUP_SIZE (stmt_info)))) { if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, "not falling back to elementwise accesses\n"); return false; } now, vect_dissolve_slp_only_groups seems to compute wrong DR_GROUP_SIZE for the !STMT_VINFO_STRIDED_P case. It does DR_GROUP_SIZE (vinfo) = 1; if (STMT_VINFO_STRIDED_P (first_element)) DR_GROUP_GAP (vinfo) = 0; else DR_GROUP_GAP (vinfo) = group_size - 1; and for strided accesses we shouldn't really keep a grouped load. I really wonder if we have enough test coverage here to assess correctness, esp. DR_GROUP_SIZE == 1 looks awfully wrong. IMHO it should be if (STMT_VINFO_STRIDED_P (first_element)) /* No longer a grouped access. */ DR_GROUP_FIRST_ELEMENT (vinfo) = NULL; else { DR_GROUP_SIZE (vinfo) = group_size; DR_GROUP_GAP (vinfo) = group_size - 1; } see vect_analyze_group_access_1. For strided the current settings might work out fine.