https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68707
--- Comment #14 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to rguent...@suse.de from comment #12) > On Fri, 11 Dec 2015, alalaw01 at gcc dot gnu.org wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68707 > > > > --- Comment #10 from alalaw01 at gcc dot gnu.org --- > > slp-perm-11.c: a 16-iteration loop gets unrolled *2, and now uses an st2, > > but > > no load_lanes, just a bunch of ldr's: 10 rather than the original 3(*2). 3 > > strs > > become 4 stp's (+st2). Doesn't look like an improvement! > > I'll look at this case as I intended to make sure using load-lanes is > possible. Hopefully reproduces with a cross ;) Ok, so the reason is the load is strided. Interleaving will end up loading individual elements while SLP will possibly load sub-vectors at a time: vector(2) int _22; vector(2) int _24; vector(4) int vect_cst__26; ... <bb 3>: # ivtmp.19_16 = PHI <ivtmp.19_17(3), ivtmp.19_4(2)> # ivtmp.22_6 = PHI <ivtmp.22_7(3), ivtmp.22_8(2)> _10 = (void *) ivtmp.19_16; _22 = MEM[base: _10, offset: 0B]; _24 = MEM[base: _10, index: _2, offset: 0B]; vect_cst__26 = {_22, _24}; vect__8.14_27 = VEC_PERM_EXPR <vect_cst__26, vect_cst__26, { 1, 0, 3, 2 }>; _11 = (void *) ivtmp.22_6; MEM[base: _11, offset: 0B] = vect__8.14_27; ivtmp.19_17 = ivtmp.19_16 + _19; ivtmp.22_7 = ivtmp.22_6 + 16; if (ivtmp.22_7 != _15) goto <bb 3>; loads two two element vectors and concats them (and the permutes them accordingly of course). I'd like to make the strided case explicit, either by not dumping the SLP or by dumping the SLP anyway (ignoring the strided case which will remain strided but possibly less efficient).