https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106081
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |rsandifo at gcc dot gnu.org --- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> --- So what's interesting is that we now get as of r14-2117-gdd86a5a69cbda4 the following. The odd thing is that we fail to eliminate the load permutation { 3 2 1 0 } even though this is a reduction group. I _suppose_ the reason is the { 0 0 0 0 } load permutation (the "splat") which we don't "support". In vect_optimize_slp_pass::start_choosing_layouts there's if (SLP_TREE_LOAD_PERMUTATION (node).exists ()) { /* If splitting out a SLP_TREE_LANE_PERMUTATION can make the node unpermuted, record a layout that reverses this permutation. We would need more work to cope with loads that are internally permuted and also have inputs (such as masks for IFN_MASK_LOADs). */ gcc_assert (partition.layout == 0 && !m_slpg->vertices[node_i].succ); if (!STMT_VINFO_GROUPED_ACCESS (dr_stmt)) continue; which means we'll keep the permute there (well, that's OK - any permute of the permute will retain it ...). I suspect this prevents the optimization here. Massaging start_choosing_layouts to allow a splat on element zero for a non-grouped access breaks things as we try to move that permute. So I guess this needs a new kind of layout constraint? The permute can absorb any permute but we cannot "move" it. Richard? t.c:14:18: note: === scheduling SLP instances === t.c:14:18: note: Vectorizing SLP tree: t.c:14:18: note: node 0x4304170 (max_nunits=16, refcnt=2) vector(4) double t.c:14:18: note: op template: _21 = _20 + results$d_60; t.c:14:18: note: stmt 0 _21 = _20 + results$d_60; t.c:14:18: note: stmt 1 _17 = _16 + results$c_58; t.c:14:18: note: stmt 2 _13 = _12 + results$b_56; t.c:14:18: note: stmt 3 _9 = _8 + results$a_54; t.c:14:18: note: children 0x43041f8 0x4304418 t.c:14:18: note: node 0x43041f8 (max_nunits=16, refcnt=1) vector(4) double t.c:14:18: note: op template: _20 = _1 * _19; t.c:14:18: note: stmt 0 _20 = _1 * _19; t.c:14:18: note: stmt 1 _16 = _1 * _15; t.c:14:18: note: stmt 2 _12 = _1 * _11; t.c:14:18: note: stmt 3 _8 = _1 * _7; t.c:14:18: note: children 0x4304280 0x4304308 t.c:14:18: note: node 0x4304280 (max_nunits=4, refcnt=1) vector(4) double t.c:14:18: note: op template: _1 = *k_50; t.c:14:18: note: stmt 0 _1 = *k_50; t.c:14:18: note: stmt 1 _1 = *k_50; t.c:14:18: note: stmt 2 _1 = *k_50; t.c:14:18: note: stmt 3 _1 = *k_50; t.c:14:18: note: load permutation { 0 0 0 0 } t.c:14:18: note: node 0x4304308 (max_nunits=16, refcnt=1) vector(4) double t.c:14:18: note: op template: _19 = (double) _18; t.c:14:18: note: stmt 0 _19 = (double) _18; t.c:14:18: note: stmt 1 _15 = (double) _14; t.c:14:18: note: stmt 2 _11 = (double) _10; t.c:14:18: note: stmt 3 _7 = (double) _6; t.c:14:18: note: children 0x4304390 t.c:14:18: note: node 0x4304390 (max_nunits=16, refcnt=1) vector(16) short int t.c:14:18: note: op template: _18 = _5->d; t.c:14:18: note: stmt 0 _18 = _5->d; t.c:14:18: note: stmt 1 _14 = _5->c; t.c:14:18: note: stmt 2 _10 = _5->b; t.c:14:18: note: stmt 3 _6 = _5->a; t.c:14:18: note: load permutation { 3 2 1 0 } t.c:14:18: note: node 0x4304418 (max_nunits=4, refcnt=1) vector(4) double t.c:14:18: note: op template: results$d_60 = PHI <_21(5), 0.0(6)> t.c:14:18: note: stmt 0 results$d_60 = PHI <_21(5), 0.0(6)> t.c:14:18: note: stmt 1 results$c_58 = PHI <_17(5), 0.0(6)> t.c:14:18: note: stmt 2 results$b_56 = PHI <_13(5), 0.0(6)> t.c:14:18: note: stmt 3 results$a_54 = PHI <_9(5), 0.0(6)> t.c:14:18: note: children 0x4304170 (nil)