As GCC's reassociation pass does not have knowledge of FMA, when transforming expression lists to parallel, it reduces the opportunities to generate FMAs. Currently there's a workaround on AArch64 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84114), that is, to disable the parallelization with floating-point additions. However, this approach may cause regressions. For example, in the code below there are only floating-point additions when calculating "result += array[j]", and rewriting to parallel is better:
// Compile with -Ofast on aarch64 float foo (int n, float in) { float array[8] = { 0.1, 1.0, 1.1, 100.0, 10.5, 0.5, 0.01, 9.9 }; float result = 0.0; for (int i = 0; i < n; i++) { if (i % 10) for (unsigned j = 0; j < 8; j++) array[j] *= in; for (unsigned j = 0; j < 8; j++) result += array[j]; } return result; } To improve this, one option is to count the number of MUL_EXPRs in an operator list before rewriting to parallel, and allow the rewriting when there's none (or 1 MUL_EXPR). This is simple and unlikely to introduce regressions. However it lacks flexibility and can not handle more general cases. Here's an attempt to address the issue more generally. 1. Added an additional widening_mul pass before the original reassoc2 pass. The new pass is limited to only insert FMA, and leave other operations like convert_mult_to_widen to the old late widening_mul pass, in case other optimizations between the two passes could be hindered. 2. On some platforms, for a very long FMA chain, rewriting to parallel can be faster. Extended the original "deferring" logic so that all conversions to FMA can be deferred. Introduced a new parameter op-count-prefer-reassoc to control this behavior. 3. Additionally, the new widening_mul pass calls execute_reassoc first, to avoid losing opportunities such as folding constants and undistributing. However, changing the sequence of generating FMA and reassociation may expose more FMA chains that are slow (see commit 4a0d0ed2). To reduce possible regressions, improved handling the slow FMA chain by: 1. Modified result_of_phi to support checking an additional FADD/FMUL. 2. On some CPUs, rather than removing the whole FMA chain, only skipping a few candidates may generate faster code. Added new parameter fskip-fma-heuristic to control this behavior. This patch also solves https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98350. Thanks, Di Zhao
0001-Improve-generating-FMA-by-adding-a-widening_mul-pass.patch
Description: 0001-Improve-generating-FMA-by-adding-a-widening_mul-pass.patch