O3-pr36098.c vectorized using VEC_PERM_EXPR rather than VEC_LOAD_LANES

rguenth at gcc dot gnu.org Mon, 14 Dec 2015 07:49:04 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68707


--- Comment #14 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to rguent...@suse.de from comment #12)
> On Fri, 11 Dec 2015, alalaw01 at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68707
> > 
> > --- Comment #10 from alalaw01 at gcc dot gnu.org ---
> > slp-perm-11.c: a 16-iteration loop gets unrolled *2, and now uses an st2, 
> > but
> > no load_lanes, just a bunch of ldr's: 10 rather than the original 3(*2). 3 
> > strs
> > become 4 stp's (+st2). Doesn't look like an improvement!
> 
> I'll look at this case as I intended to make sure using load-lanes is
> possible.  Hopefully reproduces with a cross ;)

Ok, so the reason is the load is strided.  Interleaving will end up
loading individual elements while SLP will possibly load sub-vectors
at a time:

  vector(2) int _22;
  vector(2) int _24;
  vector(4) int vect_cst__26;
...
  <bb 3>:
  # ivtmp.19_16 = PHI <ivtmp.19_17(3), ivtmp.19_4(2)>
  # ivtmp.22_6 = PHI <ivtmp.22_7(3), ivtmp.22_8(2)>
  _10 = (void *) ivtmp.19_16;
  _22 = MEM[base: _10, offset: 0B];
  _24 = MEM[base: _10, index: _2, offset: 0B];
  vect_cst__26 = {_22, _24};
  vect__8.14_27 = VEC_PERM_EXPR <vect_cst__26, vect_cst__26, { 1, 0, 3, 2 }>;
  _11 = (void *) ivtmp.22_6;
  MEM[base: _11, offset: 0B] = vect__8.14_27;
  ivtmp.19_17 = ivtmp.19_16 + _19;
  ivtmp.22_7 = ivtmp.22_6 + 16;
  if (ivtmp.22_7 != _15)
    goto <bb 3>;

loads two two element vectors and concats them (and the permutes them
accordingly of course).

I'd like to make the strided case explicit, either by not dumping the SLP
or by dumping the SLP anyway (ignoring the strided case which will remain
strided but possibly less efficient).

[Bug tree-optimization/68707] [6 Regression] testcase gcc.dg/vect/O3-pr36098.c vectorized using VEC_PERM_EXPR rather than VEC_LOAD_LANES

Reply via email to