https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117738
Bug ID: 117738
Summary: Failure to recognize dot-product pattern in inner loop
Product: gcc
Version: unknown
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: fxue at os dot amperecomputing.com
Target Milestone: ---
Take a two-level loop-nest:
void foo(int8_t *__restrict__ A, int8_t *__restrict__ B, int32_t
*__restrict__ sum, int n, int m)
{
for (int i = 0; i < n; ++i) {
int8_t a = A[i];
for (int j = 0; j < m; j++) {
int8_t b = B[T_FN(j) + i];
sum[j] += a * b;
}
}
}
Suppose T_FN() is some kind of pure mathematical function. Now although gcc
could vectorize inner loop independent of the outer one regarding simple form
of T_FN(), the result is basically far from optimal. If we consider loop-nest
as a whole, and unroll the outer loop by an appropriate VF(for example, let
VF=8 for 128 bit-vectorization width), we could make accumulate statement of
the inner loop fit into more compact dot-product pattern as: (leftover epilog
loop is omitted)
for (int i = 0; i < n; i += 8) {
<vector(8) int8_t> v_a = LOAD<vector(8) int8_t>(&A[i]);
for (int j = 0; j < m; j++) {
<vector(8) int8_t> v_b = LOAD<vector(8) int8_t>(&B[T_FN(j) + i]);
sum[j] += DOT_PROD(v_a * v_b);
}
}