https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105053
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org Status|NEW |ASSIGNED --- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> --- One notable difference is that the first loop is detected to require peeling for gaps while the second one is not (probably an artifact of the low trip count). The second is that the first loop is detected as reduction path while the second one as reduction chain. OK, so I think I see what goes wrong. We elided the load permutation but the load is still biased wrongly. vectp.67_112 = _93 + 8; <bb 12> [local count: 405853744]: # i_98 = PHI <i_44(21), 0(11)> # prephitmp_7 = PHI <prephitmp_97(21), 0(11)> # ivtmp_31 = PHI <ivtmp_37(21), 4(11)> # vectp.66_105 = PHI <vectp.66_68(21), vectp.67_112(11)> # vect_prephitmp_7.71_61 = PHI <vect__26.72_62(21), { 0, 0, 0, 0 }(11)> # ivtmp_58 = PHI <ivtmp_81(21), 0(11)> _3 = (long unsigned int) i_98; _59 = _3 * 16; _60 = _93 + _59; _106 = MEM <vector(2) int> [(const int &)vectp.66_105]; vect__54.68_113 = {_106, { 0, 0 }}; vectp.66_95 = vectp.66_105 + 16; _89 = MEM <vector(2) int> [(const int &)vectp.66_95]; vect__54.69_90 = {_89, { 0, 0 }}; vect__51.70_78 = VEC_PERM_EXPR <vect__54.68_113, vect__54.69_90, { 0, 1, 4, 5 }>; possibly because the SLP representative is unchanged when we transform t.C:17:16: note: node 0x3382280 (max_nunits=4, refcnt=2) const vector(4) int t.C:17:16: note: op template: _54 = MEM[(const int &)_60 + 12]; t.C:17:16: note: stmt 0 _54 = MEM[(const int &)_60 + 12]; t.C:17:16: note: stmt 1 _51 = MEM[(const int &)_60 + 8]; t.C:17:16: note: load permutation { 1 0 } into t.C:17:16: note: node 0x3382280 (max_nunits=4, refcnt=1) const vector(4) int t.C:17:16: note: op template: _54 = MEM[(const int &)_60 + 12]; t.C:17:16: note: stmt 0 _51 = MEM[(const int &)_60 + 8]; t.C:17:16: note: stmt 1 _54 = MEM[(const int &)_60 + 12]; t.C:17:16: note: load permutation { 0 1 } during SLP optimize.