https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117031
Bug ID: 117031
Summary: increasing VF during SLP vectorization permutes
unnecessarily
Product: gcc
Version: 15.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: tnfchris at gcc dot gnu.org
Target Milestone: ---
The following testcase:
---
void
test1 (unsigned short *x, double *y, int n)
{
for (int i = 0; i < n; i++)
{
unsigned short a = x[i * 4 + 0];
unsigned short b = x[i * 4 + 1];
unsigned short c = x[i * 4 + 2];
unsigned short d = x[i * 4 + 3];
y[i] = (double)a + (double)b + (double)c + (double)d;
}
}
---
at -O3 vectorizes using LOAD_LANES on aarch64:
vect_array.11 = .LOAD_LANES (MEM <short unsigned int[32]> [(short unsigned
int *)vectp_x.9_123]);
vect_a_29.12_125 = vect_array.11[0];
vect__14.17_129 = [vec_unpack_lo_expr] vect_a_29.12_125;
vect__14.17_130 = [vec_unpack_hi_expr] vect_a_29.12_125;
vect__14.16_131 = [vec_unpack_lo_expr] vect__14.17_129;
vect__14.16_132 = [vec_unpack_hi_expr] vect__14.17_129;
vect__14.16_133 = [vec_unpack_lo_expr] vect__14.17_130;
vect__14.16_134 = [vec_unpack_hi_expr] vect__14.17_130;
vect__14.18_135 = (vector(2) double) vect__14.16_131;
vect__14.18_136 = (vector(2) double) vect__14.16_132;
vect__14.18_137 = (vector(2) double) vect__14.16_133;
vect__14.18_138 = (vector(2) double) vect__14.16_134;
...
because input type is 4 shorts, so V4HI is the natural size. V4HI fails to
vectorize because
we don't support direct conversion from V4HI to V4SI.
We then pick a higher VF (V8HI) and the loads are detected as interleaving.
LLVM however avoids
the permute here by detecting that the unrolling doesn't result in a permuted
access as it's
equivalent to:
void
test3 (unsigned short *x, double *y, int n)
{
for (int i = 0; i < n; i+=2)
{
unsigned short a1 = x[i * 4 + 0];
unsigned short b1 = x[i * 4 + 1];
unsigned short c1 = x[i * 4 + 2];
unsigned short d1 = x[i * 4 + 3];
y[i+0] = (double)a1 + (double)b1 + (double)c1 + (double)d1;
unsigned short a2 = x[i * 4 + 4];
unsigned short b2 = x[i * 4 + 5];
unsigned short c2 = x[i * 4 + 6];
unsigned short d2 = x[i * 4 + 7];
y[i+1] = (double)a2 + (double)b2 + (double)c2 + (double)d2;
}
}
GCC seems to miss that there is no gap between the group accesses and that
stride == 1.
test3 is vectorized linearly by GCC, so it seems this is missed optimization in
data ref analysis?