https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109747
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Target Milestone|--- |12.3 Target| |x86_64-*-* i?86-*-* Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org Status|UNCONFIRMED |ASSIGNED CC| |rsandifo at gcc dot gnu.org Ever confirmed|0 |1 Keywords| |missed-optimization Last reconfirmed| |2023-05-05 --- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> --- A fix, and maybe exactly a step in the right direction, would be to construct individual new SLP nodes for each call to record_stmt_cost from vect_prologue_cost_for_slp: /* ??? We're just tracking whether vectors in a single node are the same. Ideally we'd do something more global. */ for (unsigned int start : starts) { vect_cost_for_stmt kind; if (SLP_TREE_DEF_TYPE (node) == vect_constant_def) kind = vector_load; else if (vect_scalar_ops_slice { ops, start, nelt_limit }.all_same_p ()) kind = scalar_to_vec; else kind = vec_construct; record_stmt_cost (cost_vec, 1, kind, node, vectype, 0, vect_prologue); } alternatively we could pass down 'start' as well. The x86 backend code could also detect the mismatch of TYPE_VECTOR_SUBPARTS * count and the number of SLP lanes (but not sure what it should do in that case). Note we can't currently meaningfully put such a split set of SLP nodes into the SLP graph, but in the end we might want to go into the direction of splitting it into individual vector ops, esp. for load/store vectorization and interleaving. Short-term passing down 'start' (and only interpreting it with count is one?) might be easiest. Any opinions?