https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110279
--- Comment #1 from Di Zhao <dizhao at os dot amperecomputing.com> ---
Here's a small example for the issue exposed in 508.namd_r:
#define LOOP_COUNT 800000000
typedef double data_e;
#include <stdio.h>
__attribute_noinline__ data_e
foo (data_e a, data_e b, data_e c, data_e d, data_e x, data_e y)
{
data_e tmp1, tmp2;
data_e result = 0;
for (int ic = 0; ic < LOOP_COUNT; ic++)
{
/* LHS is operator of another FMA, re-writing to parallel is worse.
*/
tmp1 = a + c * c - d * d + x * y;
tmp2 = x * tmp1;
result += (a + c - d + tmp2);
a -= 0.1;
b += 0.9;
c *= 1.02;
x *= 0.1;
y *= y;
d *= 0.61;
}
return result;
}
int
main (int argc, char **argv)
{
printf ("%f\n", foo (-1.0, 0.01, 9.8, 1e2, -1.9, 0.2));
}
Tested on the following platforms, rewriting all the two op list is worse than
no-rewriting or only rewriting "result" (compile option I used are "-Ofast
--param tree-reassoc-width=4 -march=native"):
run no rewrite rewrite rewrite
time rewrite "result" "tmp1" both
-----------------------------------------------
Ampere1 1.80 1.93 2.04 2.10
Neoverse-n1 1.36 1.45 1.49 1.52
Intel Xeon 1.57 1.55 1.66 1.62