[Bug tree-optimization/110279] New: Regressions on aarch64 cause by handing FMA in reassoc (510.parest_r, 508.namd_r)

dizhao at os dot amperecomputing.com via Gcc-bugs Fri, 16 Jun 2023 01:32:30 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110279


            Bug ID: 110279
           Summary: Regressions on aarch64 cause by handing FMA in reassoc
                    (510.parest_r, 508.namd_r)
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: dizhao at os dot amperecomputing.com
  Target Milestone: ---

Created attachment 55339
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55339&action=edit
[PATCH] Check for nested FMA chains in reassoc

After testing the recent patch "Handle FMA friendly in reassoc pass" (e5405f06)
on some of our aarch64 machines, I found regressions in a few spec2017 fprate
cases.

On ampere1, the patch introduced approximately 2% regression in 510.parest_r.
Additionally, with fp_reassoc_width changed so that reassociation actually
works on floating points additions (which brings about 1% overall benefit),
there's approximately 5% regression in 508.namd_r on ampere1, and 2.6% on
neoverse-n1.

The compile options we used is "-Ofast -mcpu=ampere1 -flto=32 --param
avoid-fma-max-bits=512" for ampere1, and "-Ofast -mcpu=neoverse-n1 -flto=32"
for neoverse-n1. The tests are single copy run.

Below is from my investigations.

1) From perf result, the regression in 510.parest_r is because the re-arranging
in rank_ops_for_fma() produced 2 FMAs in a small loop, with the last FMA's
result fed back into first one from PHI. With avoid-fma-max-bits, these
candidates are dropped in widening_mul, causing 2% regression; without the
parameter there is 1% regression.

Before the patch, the generated code looks like:
        label:  ....
               fmul v2, v2, v3
               fmla v2, v4, v5
               fadd v1, v1, v2
               ...
               b.ne  label

After the patch (without avoid-fma-max-bits):
        label:  ...
               fmla v1, v2, v3
               fmla v1, v4, v5
               ...
               b.ne  label

2) As for 508.namd_r, there are slightly fewer FMAs generated. It seems the
patch is not handling FMAs like ((a * b + c) * d + e) *... well. For example,
below is a piece of CFG before reassoc2:

  _797 = A_788 * _796;
  fast_c = _797 + _1161;
  _815 = diffa * fast_d;
  _816 = fast_c + _815;
  _817 = diffa * _816;
  fast_dir = fast_b + _817;

Before the patch, optimized code looks like:

  fast_c = .FNMA (B_790, _798, _334);
  _816 = .FMA (diffa, fast_d, fast_c);
  fast_dir = .FMA (diffa, _816, fast_b);

After the patch:

  _815 = diffa * fast_d;
  _5910 = .FMA (A_788, _796, _815);
  _816 = _5909 + _5910;
  _817 = diffa * _816;
  _5908 = .FMA (A_788, _801, _817);
  fast_dir = _5907 + _5908;

I came out with a patch to solve this. I'll also attach here.

[Bug tree-optimization/110279] New: Regressions on aarch64 cause by handing FMA in reassoc (510.parest_r, 508.namd_r)

Reply via email to