On Thu, May 11, 2023 at 5:20 PM Cui, Lili <lili....@intel.com> wrote: > > > -----Original Message----- > > From: Richard Biener <richard.guent...@gmail.com> > > Sent: Thursday, May 11, 2023 6:53 PM > > To: Cui, Lili <lili....@intel.com> > > Cc: gcc-patches@gcc.gnu.org > > Subject: Re: [PATCH 1/2] PR gcc/98350:Add a param to control the length of > > the chain with FMA in reassoc pass > > Hi Richard, > Thanks for helping to review the patch. > > > > > As you are not changing the number of ops you should be able to use > > quick_push here and below. You should be able to do > > > > ops->splice (ops_mult); > > ops->splice (ops_others); > > > > as well. > > > Done. > > > > + /* When enabling param_reassoc_max_chain_length_with_fma > > to > > > + keep the chain with fma, rank_ops_for_fma will > > > detect if > > > + the chain has fmas and if so it will rearrange the > > > ops. */ > > > + if (param_reassoc_max_chain_length_with_fma > 1 > > > + && direct_internal_fn_supported_p (IFN_FMA, > > > + TREE_TYPE (lhs), > > > + opt_type) > > > + && (rhs_code == PLUS_EXPR || rhs_code == > > > MINUS_EXPR)) > > > + { > > > + keep_fma_chain = rank_ops_for_fma(&ops); > > > + } > > > + > > > + int len = ops.length (); > > > /* Only rewrite the expression tree to parallel in the > > > last reassoc pass to avoid useless work > > > back-and-forth > > > with initial linearization. */ > > > > we are doing the parallel rewrite only in the last reassoc pass, i think it > > makes > > sense to do the same for reassoc-for-fma. > > I rearranged the order of ops in reassoc1 without break the chain, it > generated more vectorize during vector pass( seen in benchmark 503). So I > rewrite the ssa tree and keep the chain with function "rewrite_expr_tree" in > reassoc1, break the chain with "rewrite_expr_tree_parallel_for_fma" in > reassoc2. > > > > > Why do the existing expr rewrites not work after re-sorting the ops? > > For case https://godbolt.org/z/3x9PWE9Kb: we put "j" at first. > > j + l * m + a * b + c * d + e * f + g * h; > > GCC trunk: width = 2, ops_num = 6, old function " rewrite_expr_tree_parallel > " generates 3 FMAs. > --------------------------------------------------------------------------------------- > _1 = l_10(D) * m_11(D); > _3 = a_13(D) * b_14(D); > _4 = j_12(D) + _3; --------> Here is one FMA. > _5 = c_15(D) * d_16(D); > _8 = _1 + _5; --------> Here is one FMA and lost one. > _7 = e_17(D) * f_18(D); > _9 = g_19(D) * h_20(D); > _2 = _7 + _9; --------> Here is one FMA and lost one. > _6 = _2 + _4; > _21 = _6 + _8; > # VUSE <.MEM_22(D)> > return _21; > -------------------------------------------------------------------------------------- > width = 2, ops_num = 6, new function " rewrite_expr_tree_parallel_for_fma " > generates 4 FMAs. > ------------------------------------------------------------------------------ > _1 = a_10(D) * b_11(D); > _3 = c_13(D) * d_14(D); > _5 = e_15(D) * f_16(D); > _7 = g_17(D) * h_18(D); > _4 = _5 + _7; --------> Here is one FMA and lost one. > _8 = _4 + _1; --------> Here is one FMA. > _9 = l_19(D) * m_20(D); > _2 = _9 + j_12(D); --------> Here is one FMA. > _6 = _2 + _3; --------> Here is one FMA. > _21 = _8 + _6; > return _21; > ------------------------------------------------------------------------------------
ISTR there were no sufficient comments in the code explaining why rewrite_expr_tree_parallel_for_fma is better by design. In fact ... > > > > > > if (!reassoc_insert_powi_p > > > - && ops.length () > 3 > > > + && len > 3 > > > + && (!keep_fma_chain > > > + || (keep_fma_chain > > > + && len > > > > + param_reassoc_max_chain_length_with_fma)) > > > > in the case len < param_reassoc_max_chain_length_with_fma we have the > > chain re-sorted but fall through to non-parallel rewrite. I wonder if we do > > not want to instead adjust the reassociation width? I'd say it depends on > > the > > number of mult cases in the chain (sth the re-sorting could have computed). > > Why do we have two completely independent --params here? Can you give > > an example --param value combination that makes "sense" and show how it > > is beneficial? > > For this small case https://godbolt.org/z/Pxczrre8P > a * b + c * d + e * f + j > > GCC trunk: ops_num = 4, targetm.sched.reassociation_width is 4 (scalar fp > cost is 4). Calculated: Width = 2. we can get 2 FMAs. > ---------------------------------- > _1 = a_6(D) * b_7(D); > _2 = c_8(D) * d_9(D); > _5 = _1 + _2; > _4 = e_10(D) * f_11(D); > _3 = _4 + j_12(D); > _13 = _3 + _5; > -------------------------------------------------------- > _2 = c_8(D) * d_9(D); > _5 = .FMA (a_6(D), b_7(D), _2); > _3 = .FMA (e_10(D), f_11(D), j_12(D)); > _13 = _3 + _5; > -------------------------------------------------------- > New patch: If just rearrange ops and fall through to parallel rewrite to > break the chain with width = 2. > > --------------------------------------------------------- > _1 = a_6(D) * b_7(D); > _2 = j + _1; -----> put j at the first. > _3 = c_8(D) * d_9(D); > _4 = e_10(D) * f_11(D); > _5 = _3 + _4; -----> break chain with width = 2. we lost a FMA here. > _13 = _2 + 5; > > ------------------------------------------------------- > _3 = c_8(D) * d_9(D); > _2 = .FMA (a_6(D), b_7(D), j); > _5 = .FMA (e_10(D), f_11(D), _3); > _13 = _2 + _5; > -------------------------------------------------------- > Sometimes break chain will lose FMA( break chain needs put two mult-ops > together, which will lose one FMA ), we can only get 2 FMAs here, if we want > to get 3 FMAs, we need to keep the chain and not break it. So I added a param > to control chain length "param_reassoc_max_chain_length_with_fma = 4" (For > the small case in Bugzilla 98350, we need to keep the chain to generate 6 > FMAs.) > ------------------------------------------------------- > _1 = a_6(D) * b_7(D); > _2 = c_8(D) * d_9(D); > _4 = e_10(D) * f_11(D); > _15 = _4 + j_12(D); > _16 = _15 + _2; > _13 = _16 + _1; > ------------------------------------------------------- > _15 = .FMA (e_10(D), f_11(D), j_12(D)); > _16 = .FMA (c_8(D), d_9(D), _15); > _13 = .FMA (a_6(D), b_7(D), _16); > ------------------------------------------------------- > In some case we want to break the chain with width, we can set > "param_reassoc_max_chain_length_with_fma = 2", it will rearrange ops and > break the chain with width. ... it sounds like the problem could be fully addressed by sorting the chain with reassoc-width in mind? Wouldn't it be preferable if rewrite_expr_tree_parallel would get a vector of mul and a vector of non-mul ops so it can pick from the optimal candidate? That said, I think rewrite_expr_tree_parallel_for_fma at least needs more comments. > > > > > && (width = get_reassociation_width (ops_num, > > > rhs_code, > > > - mode)) > 1) > > > + mode)) > > > > + 1) > > > { > > > - if (dump_file && (dump_flags & TDF_DETAILS)) > > > - fprintf (dump_file, > > > - "Width = %d was chosen for > > > reassociation\n", > > > - width); > > > - rewrite_expr_tree_parallel (as_a <gassign *> (stmt), > > > - width, ops); > > > + if (keep_fma_chain) > > > + { > > > + if (dump_file && (dump_flags & TDF_DETAILS)) > > > + fprintf (dump_file, > > > + "Break chain len = %d into width for > > > FMA\n", > > > + len); > > > + rewrite_expr_tree_parallel_for_fma > > > + (as_a <gassign *> (stmt), width, ops); > > > + } > > > + else > > > + { > > > + if (dump_file && (dump_flags & TDF_DETAILS)) > > > + fprintf (dump_file, > > > + "Width = %d was chosen for > > > reassociation\n", > > > + width); > > > + rewrite_expr_tree_parallel (as_a <gassign *> > > > (stmt), > > > + width, ops); > > > + } > > > } > > > else > > > - { > > > - /* When there are three operands left, we want > > > - to make sure the ones that get the double > > > - binary op are chosen wisely. */ > > > - int len = ops.length (); > > > - if (len >= 3) > > > + { > > > + /* When there are three operands left, we want > > > + to make sure the ones that get the double > > > + binary op are chosen wisely. */ > > > + if (len >= 3 && !keep_fma_chain) > > > swap_ops_for_binary_stmt (ops, len - 3); > > > > > > new_lhs = rewrite_expr_tree (stmt, rhs_code, 0, ops, > > > powi_result != NULL > > > || negate_result, > > > len != orig_len); > > > - } > > > - > > > + } > > > /* If we combined some repeated factors into a > > > __builtin_powi call, multiply that result by the > > > reassociated operands. */ > > > -- > > > 2.25.1 > > >