As GCC's reassociation pass does not have knowledge of FMA, when
transforming expression lists to parallel, it reduces the
opportunities to generate FMAs. Currently there's a workaround
on AArch64 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84114),
that is, to disable the parallelization with floating-point additions.
However, this approach may cause regressions. For example, in the
code below there are only floating-point additions when calculating
"result += array[j]", and rewriting to parallel is better:

// Compile with -Ofast on aarch64
float foo (int n, float in)
{
  float array[8] = { 0.1, 1.0, 1.1, 100.0, 10.5, 0.5, 0.01, 9.9 };
  float result = 0.0;
  for (int i = 0; i < n; i++)
    {
      if (i % 10)
        for (unsigned j = 0; j < 8; j++)
          array[j] *= in;

      for (unsigned j = 0; j < 8; j++)
       result += array[j];
    }
  return result;
}

To improve this, one option is to count the number of MUL_EXPRs in an
operator list before rewriting to parallel, and allow the rewriting
when there's none (or 1 MUL_EXPR). This is simple and unlikely to
introduce regressions. However it lacks flexibility and can not handle
more general cases.

Here's an attempt to address the issue more generally.

1. Added an additional widening_mul pass before the original reassoc2
pass. The new pass is limited to only insert FMA, and leave other
operations like convert_mult_to_widen to the old late widening_mul pass,
in case other optimizations between the two passes could be hindered.

2. On some platforms, for a very long FMA chain, rewriting to parallel
can be faster. Extended the original "deferring" logic so that all
conversions to FMA can be deferred. Introduced a new parameter
op-count-prefer-reassoc to control this behavior.

3. Additionally, the new widening_mul pass calls execute_reassoc first,
to avoid losing opportunities such as folding constants and
undistributing.

However, changing the sequence of generating FMA and reassociation may
expose more FMA chains that are slow (see commit 4a0d0ed2).
To reduce possible regressions, improved handling the slow FMA chain by:

1. Modified result_of_phi to support checking an additional FADD/FMUL.

2. On some CPUs, rather than removing the whole FMA chain, only skipping
a few candidates may generate faster code. Added new parameter
fskip-fma-heuristic to control this behavior.

This patch also solves https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98350.

Thanks,
Di Zhao

Attachment: 0001-Improve-generating-FMA-by-adding-a-widening_mul-pass.patch
Description: 0001-Improve-generating-FMA-by-adding-a-widening_mul-pass.patch

Reply via email to