[Bug target/88497] Improve Accumulation in Auto-Vectorized Code

2019-03-04 Thread linkw at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88497

Kewen Lin  changed:

   What|Removed |Added

 Status|SUSPENDED   |ASSIGNED

[Bug target/88497] Improve Accumulation in Auto-Vectorized Code

2019-03-04 Thread linkw at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88497

--- Comment #9 from Kewen Lin  ---
As Kelvin mentioned in the last comment, there is some thing we teach reassoc
to get the below code better, although it's in low priority.

double foo (double accumulator, vector double arg2[], vector double arg3[])
{
  vector double temp;

  temp = arg2[0] * arg3[0];
  accumulator += temp[0] + temp[1];
  temp = arg2[1] * arg3[1];
  accumulator += temp[0] + temp[1];
  temp = arg2[2] * arg3[2];
  accumulator += temp[0] + temp[1];
  temp = arg2[3] * arg3[3];
  accumulator += temp[0] + temp[1];
  return accumulator;
}

Confirmed.

[Bug target/88497] Improve Accumulation in Auto-Vectorized Code

2019-01-23 Thread kelvin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88497

kelvin at gcc dot gnu.org changed:

   What|Removed |Added

 Status|UNCONFIRMED |SUSPENDED
   Last reconfirmed||2019-01-23
 Ever confirmed|0   |1

--- Comment #8 from kelvin at gcc dot gnu.org ---
In revisiting this problem report, I have confirmed that if I specify
-ffast-math when I compile the original loop that motivated this problem
report, I get the desired optimization.  I believe my original discovery of
this optimization opportunity had omitted the -ffast-math option.

I have confirmed that reassociation does not produce the desired translation of
the loop body in isolation, even if -ffast-math is specified on the command
line, and even if I experiment with very large values of the
reassociation_width values for the rs6000 target.

Apparently, the reassociation pass is not clever enough to recognize
opportunities to transform multiply-add accumulations into expressions that
favor use of the xvmaddadp instruction.  Reassociation may change the order in
which the sums computed for different vector element products are combined with
the accumulator.  But in my experience, reassociation does not discover the
opportunity to accumulate the products from different vectors using vector sum
instructions or even better, the vector multiply-add instruction.

Since the code produced for -ffast-math auto-vectorization of multiply-add
accumulation loops is "optimal", I am recommending future effort on this issue
be treated as low priority.

[Bug target/88497] Improve Accumulation in Auto-Vectorized Code

2019-01-22 Thread kelvin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88497

--- Comment #7 from kelvin at gcc dot gnu.org ---
Here is the original program that motivated my simplified reproducer:

extern void first_dummy ();
extern void dummy (double sacc, int n);
extern void other_dummy ();

extern float opt_value;
extern char *opt_desc;

#define M 128
#define N 512

double x [N];
double y [N];

int main (int argc, char *argv []) {
  double sacc;

  first_dummy ();
  for (int j = 0; j < M; j++) {

sacc = 0.00;
for (unsigned long long int i = 0; i < N; i++) {
  sacc += x[i] * y[i];
}
dummy (sacc, N);
  }
  opt_value = ((float) N) * 2 * ((float) M);
  opt_desc = "flops";
  other_dummy ();
}

[Bug target/88497] Improve Accumulation in Auto-Vectorized Code

2018-12-17 Thread wschmidt at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88497

--- Comment #6 from Bill Schmidt  ---
Reassociation width should be 4 for this case per the target hook.  Kelvin, you
can experiment with rs6000_reassociation_width to see if larger values give you
what you expect.

[Bug target/88497] Improve Accumulation in Auto-Vectorized Code

2018-12-17 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88497

Richard Biener  changed:

   What|Removed |Added

 Target||powerpc*
  Component|middle-end  |target

--- Comment #5 from Richard Biener  ---
I think it is reassoc doing it "wrong" based on the targets reassoc-width?
Because the vectorizer generates exactly the code you are proposing.

Though you didn't even provide a fully compilable testcase.  I guessed N to
be 16 here and your ideal examples use 'accumulator' which I assume to be 0.0.

Your x86 code-gen examples are also from GCC 8 I assume (plus some -march/tune
flag you didn't expose given it uses haddpd)