https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82151
Bug ID: 82151 Summary: Autovectorization for insertion is slower than done manually Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: pinskia at gcc dot gnu.org Target Milestone: --- Target: aarch64 Take: void f(float *restrict a, float * restrict b, float * restrict c, int s) { for(int i = 0; i< s;i++) { c[i*4] = a[i*2]; c[i*4+1] = a[i*2+1]; c[i*4+2] = b[i*2]; c[i*4+3] = b[i*2+1]; } } --- CUT --- This currently vectorizes using 2xld2 followed by st4. On some (most?) micro-arch, not vecotrizing is better or vectorizing using 64bit (2xS): ldr d0, [a, index] ldr d1, [b, index] stp d0, d1, [c, index] is better. Or even: ldr d0, [a, index] ldr d1, [b, index] ins v0.2d[1], d1 sdr q0, [c, index] is better than using ld2/st3. That is just do SLP vectorization and not loop aware SLP here.