https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110279
Bug ID: 110279 Summary: Regressions on aarch64 cause by handing FMA in reassoc (510.parest_r, 508.namd_r) Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: dizhao at os dot amperecomputing.com Target Milestone: --- Created attachment 55339 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55339&action=edit [PATCH] Check for nested FMA chains in reassoc After testing the recent patch "Handle FMA friendly in reassoc pass" (e5405f06) on some of our aarch64 machines, I found regressions in a few spec2017 fprate cases. On ampere1, the patch introduced approximately 2% regression in 510.parest_r. Additionally, with fp_reassoc_width changed so that reassociation actually works on floating points additions (which brings about 1% overall benefit), there's approximately 5% regression in 508.namd_r on ampere1, and 2.6% on neoverse-n1. The compile options we used is "-Ofast -mcpu=ampere1 -flto=32 --param avoid-fma-max-bits=512" for ampere1, and "-Ofast -mcpu=neoverse-n1 -flto=32" for neoverse-n1. The tests are single copy run. Below is from my investigations. 1) From perf result, the regression in 510.parest_r is because the re-arranging in rank_ops_for_fma() produced 2 FMAs in a small loop, with the last FMA's result fed back into first one from PHI. With avoid-fma-max-bits, these candidates are dropped in widening_mul, causing 2% regression; without the parameter there is 1% regression. Before the patch, the generated code looks like: label: .... fmul v2, v2, v3 fmla v2, v4, v5 fadd v1, v1, v2 ... b.ne label After the patch (without avoid-fma-max-bits): label: ... fmla v1, v2, v3 fmla v1, v4, v5 ... b.ne label 2) As for 508.namd_r, there are slightly fewer FMAs generated. It seems the patch is not handling FMAs like ((a * b + c) * d + e) *... well. For example, below is a piece of CFG before reassoc2: _797 = A_788 * _796; fast_c = _797 + _1161; _815 = diffa * fast_d; _816 = fast_c + _815; _817 = diffa * _816; fast_dir = fast_b + _817; Before the patch, optimized code looks like: fast_c = .FNMA (B_790, _798, _334); _816 = .FMA (diffa, fast_d, fast_c); fast_dir = .FMA (diffa, _816, fast_b); After the patch: _815 = diffa * fast_d; _5910 = .FMA (A_788, _796, _815); _816 = _5909 + _5910; _817 = diffa * _816; _5908 = .FMA (A_788, _801, _817); fast_dir = _5907 + _5908; I came out with a patch to solve this. I'll also attach here.