https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118028
Bug ID: 118028
Summary: A better vectorized reduction across multi-level
loop-nest
Product: gcc
Version: unknown
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: fxue at os dot amperecomputing.com
Target Milestone: ---
Look at the case:
int foo(const int *array)
{
int sum = 0;
#pragma GCC unroll 0
for (int i = 0; i < 32; ++i) {
#pragma GCC unroll 0
for (int j = 0; j < 32; ++j) {
sum += array[i * 64 + j];
}
}
return sum;
}
For sure, the accumulation on "sum" is vectorized with inner loop. Since the
outer loop is of scalarized form, current means would combine the vectorized
"sum" to a scalar one via .REDUC_PLUS when execution backs to the outer loop
as:
int sum = 0;
for (int i = 0; i < 32; ++i) {
vector(4) int v_sum = { sum, 0, 0, 0 };
for (int j = 0; j < 32; j += 4) {
v_sum += *(vector(4) int *)(&array[i * 64 + j]);
}
sum += .REDUC_PLUS(v_sum);
}
Because there is no other use of "sum" except handing over its value to next
round of accumulation in inner loop, the vector->scalar translation of "sum" in
the outer loop is not really needed, a more efficient means is to move the
computation to a point after exit of the whole loop nest as:
vector(4) int v_sum = { 0, 0, 0, 0 };
for (int i = 0; i < 32; ++i) {
for (int j = 0; j < 32; j += 4) {
v_sum += *(vector(4) int *)(&array[i * 64 + j]);
}
}
sum = .REDUC_PLUS(v_sum);