https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82151

            Bug ID: 82151
           Summary: Autovectorization for insertion is slower than done
                    manually
           Product: gcc
           Version: 8.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: pinskia at gcc dot gnu.org
  Target Milestone: ---
            Target: aarch64

Take:
void f(float *restrict a, float * restrict b, float * restrict c, int s)
{
  for(int i = 0; i< s;i++)
    {
      c[i*4] = a[i*2];
      c[i*4+1] = a[i*2+1];
      c[i*4+2] = b[i*2];
      c[i*4+3] = b[i*2+1];
    }
}

--- CUT ---
This currently vectorizes using 2xld2 followed by st4.  On some (most?)
micro-arch, not vecotrizing is better or vectorizing using 64bit (2xS):
ldr d0, [a, index]
ldr d1, [b, index]
stp d0, d1, [c, index]
is better.

Or even:
ldr d0, [a, index]
ldr d1, [b, index]
ins v0.2d[1], d1
sdr q0, [c, index]
is better than using ld2/st3.

That is just do SLP vectorization and not loop aware SLP here.

Reply via email to