https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114440

            Bug ID: 114440
           Summary: Fail to recognize a chain of lane-reduced operations
                    for loop reduction vect
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: fxue at os dot amperecomputing.com
  Target Milestone: ---

In a loop reduction path containing a lane-reduced operation
(DOT_PROD/SAD/WIDEN_SUM), current vectorizer could not handle the pattern if
there are other operations, which might be a normal or another lane-reduced
one. A pseudo example is represented as:

   char *d0, *d1;
   char *s0, *s1;
   char *w;
   int *n;

   ...
   int sum = 0;

   for (i) {
     ...
     sum += d0[i] * d1[i];       /* DOT_PROD */
     ...
     sum += abs(s0[i] - s1[i]);  /* SAD */
     ...
     sum += w[i];                /* WIDEN_SUM */
     ...
     sum += n[i];                /* Normal */
     ...
   }

   ... = sum;

For the case, reduction vectype would vary with operations, and this causes
mismatch on count of vectorized defs and uses, a possible means might be fixing
that by generating extra trivial pass-through copies. Given a concrete example
as:

   sum = 0; 
   for (i) {
     sum += d0[i] * d1[i];       /* 16*char -> 4*int */
     sum += n[i];                /*   4*int -> 4*int */
   }

Final vetorized statements could be:

   sum_v0 = { 0, 0, 0, 0 };
   sum_v1 = { 0, 0, 0, 0 };
   sum_v2 = { 0, 0, 0, 0 };
   sum_v3 = { 0, 0, 0, 0 };

   for (i / 16) {
     sum_v0 += DOT_PROD (v_d0[i: 0 .. 15], v_d1[i: 0 .. 15]);
     sum_v1 += 0;  // copy
     sum_v2 += 0;  // copy
     sum_v3 += 0;  // copy

     sum_v0 += v_n[i:  0 .. 3];
     sum_v1 += v_n[i:  4 .. 7];
     sum_v2 += v_n[i:  8 .. 11];
     sum_v3 += v_n[i: 12 .. 15]; 
   }

   sum = REDUC_PLUS(sum_v0 + sum_v1 + sum_v2 + sum_v3);

In the above sequence, one summation statement simply forms one pattern.
Though, we could easily compose a somewhat more complicated variant that gets
into the similar situation. That is, a chain of lane-reduced operations comes
from the non-reduction addend in one summation statement, like:

   sum += d0[i] * d1[i] + abs(s0[i] - s1[i]) + n[i];

Probably, this requires some extension in the vector pattern formation stage to
split the patterns.
  • [Bug tree-optimization/114... fxue at os dot amperecomputing.com via Gcc-bugs

Reply via email to