> -----Original Message----- > From: Richard Biener <richard.guent...@gmail.com> > Sent: Thursday, May 11, 2023 6:53 PM > To: Cui, Lili <lili....@intel.com> > Cc: gcc-patches@gcc.gnu.org > Subject: Re: [PATCH 1/2] PR gcc/98350:Add a param to control the length of > the chain with FMA in reassoc pass
Hi Richard, Thanks for helping to review the patch. > > As you are not changing the number of ops you should be able to use > quick_push here and below. You should be able to do > > ops->splice (ops_mult); > ops->splice (ops_others); > > as well. > Done. > > + /* When enabling param_reassoc_max_chain_length_with_fma > to > > + keep the chain with fma, rank_ops_for_fma will detect > > if > > + the chain has fmas and if so it will rearrange the > > ops. */ > > + if (param_reassoc_max_chain_length_with_fma > 1 > > + && direct_internal_fn_supported_p (IFN_FMA, > > + TREE_TYPE (lhs), > > + opt_type) > > + && (rhs_code == PLUS_EXPR || rhs_code == MINUS_EXPR)) > > + { > > + keep_fma_chain = rank_ops_for_fma(&ops); > > + } > > + > > + int len = ops.length (); > > /* Only rewrite the expression tree to parallel in the > > last reassoc pass to avoid useless work back-and-forth > > with initial linearization. */ > > we are doing the parallel rewrite only in the last reassoc pass, i think it > makes > sense to do the same for reassoc-for-fma. I rearranged the order of ops in reassoc1 without break the chain, it generated more vectorize during vector pass( seen in benchmark 503). So I rewrite the ssa tree and keep the chain with function "rewrite_expr_tree" in reassoc1, break the chain with "rewrite_expr_tree_parallel_for_fma" in reassoc2. > > Why do the existing expr rewrites not work after re-sorting the ops? For case https://godbolt.org/z/3x9PWE9Kb: we put "j" at first. j + l * m + a * b + c * d + e * f + g * h; GCC trunk: width = 2, ops_num = 6, old function " rewrite_expr_tree_parallel " generates 3 FMAs. --------------------------------------------------------------------------------------- _1 = l_10(D) * m_11(D); _3 = a_13(D) * b_14(D); _4 = j_12(D) + _3; --------> Here is one FMA. _5 = c_15(D) * d_16(D); _8 = _1 + _5; --------> Here is one FMA and lost one. _7 = e_17(D) * f_18(D); _9 = g_19(D) * h_20(D); _2 = _7 + _9; --------> Here is one FMA and lost one. _6 = _2 + _4; _21 = _6 + _8; # VUSE <.MEM_22(D)> return _21; -------------------------------------------------------------------------------------- width = 2, ops_num = 6, new function " rewrite_expr_tree_parallel_for_fma " generates 4 FMAs. ------------------------------------------------------------------------------ _1 = a_10(D) * b_11(D); _3 = c_13(D) * d_14(D); _5 = e_15(D) * f_16(D); _7 = g_17(D) * h_18(D); _4 = _5 + _7; --------> Here is one FMA and lost one. _8 = _4 + _1; --------> Here is one FMA. _9 = l_19(D) * m_20(D); _2 = _9 + j_12(D); --------> Here is one FMA. _6 = _2 + _3; --------> Here is one FMA. _21 = _8 + _6; return _21; ------------------------------------------------------------------------------------ > > > if (!reassoc_insert_powi_p > > - && ops.length () > 3 > > + && len > 3 > > + && (!keep_fma_chain > > + || (keep_fma_chain > > + && len > > > + param_reassoc_max_chain_length_with_fma)) > > in the case len < param_reassoc_max_chain_length_with_fma we have the > chain re-sorted but fall through to non-parallel rewrite. I wonder if we do > not want to instead adjust the reassociation width? I'd say it depends on the > number of mult cases in the chain (sth the re-sorting could have computed). > Why do we have two completely independent --params here? Can you give > an example --param value combination that makes "sense" and show how it > is beneficial? For this small case https://godbolt.org/z/Pxczrre8P a * b + c * d + e * f + j GCC trunk: ops_num = 4, targetm.sched.reassociation_width is 4 (scalar fp cost is 4). Calculated: Width = 2. we can get 2 FMAs. ---------------------------------- _1 = a_6(D) * b_7(D); _2 = c_8(D) * d_9(D); _5 = _1 + _2; _4 = e_10(D) * f_11(D); _3 = _4 + j_12(D); _13 = _3 + _5; -------------------------------------------------------- _2 = c_8(D) * d_9(D); _5 = .FMA (a_6(D), b_7(D), _2); _3 = .FMA (e_10(D), f_11(D), j_12(D)); _13 = _3 + _5; -------------------------------------------------------- New patch: If just rearrange ops and fall through to parallel rewrite to break the chain with width = 2. --------------------------------------------------------- _1 = a_6(D) * b_7(D); _2 = j + _1; -----> put j at the first. _3 = c_8(D) * d_9(D); _4 = e_10(D) * f_11(D); _5 = _3 + _4; -----> break chain with width = 2. we lost a FMA here. _13 = _2 + 5; ------------------------------------------------------- _3 = c_8(D) * d_9(D); _2 = .FMA (a_6(D), b_7(D), j); _5 = .FMA (e_10(D), f_11(D), _3); _13 = _2 + _5; -------------------------------------------------------- Sometimes break chain will lose FMA( break chain needs put two mult-ops together, which will lose one FMA ), we can only get 2 FMAs here, if we want to get 3 FMAs, we need to keep the chain and not break it. So I added a param to control chain length "param_reassoc_max_chain_length_with_fma = 4" (For the small case in Bugzilla 98350, we need to keep the chain to generate 6 FMAs.) ------------------------------------------------------- _1 = a_6(D) * b_7(D); _2 = c_8(D) * d_9(D); _4 = e_10(D) * f_11(D); _15 = _4 + j_12(D); _16 = _15 + _2; _13 = _16 + _1; ------------------------------------------------------- _15 = .FMA (e_10(D), f_11(D), j_12(D)); _16 = .FMA (c_8(D), d_9(D), _15); _13 = .FMA (a_6(D), b_7(D), _16); ------------------------------------------------------- In some case we want to break the chain with width, we can set "param_reassoc_max_chain_length_with_fma = 2", it will rearrange ops and break the chain with width. > > > && (width = get_reassociation_width (ops_num, > > rhs_code, > > - mode)) > 1) > > + mode)) > > > + 1) > > { > > - if (dump_file && (dump_flags & TDF_DETAILS)) > > - fprintf (dump_file, > > - "Width = %d was chosen for > > reassociation\n", > > - width); > > - rewrite_expr_tree_parallel (as_a <gassign *> (stmt), > > - width, ops); > > + if (keep_fma_chain) > > + { > > + if (dump_file && (dump_flags & TDF_DETAILS)) > > + fprintf (dump_file, > > + "Break chain len = %d into width for > > FMA\n", > > + len); > > + rewrite_expr_tree_parallel_for_fma > > + (as_a <gassign *> (stmt), width, ops); > > + } > > + else > > + { > > + if (dump_file && (dump_flags & TDF_DETAILS)) > > + fprintf (dump_file, > > + "Width = %d was chosen for > > reassociation\n", > > + width); > > + rewrite_expr_tree_parallel (as_a <gassign *> > > (stmt), > > + width, ops); > > + } > > } > > else > > - { > > - /* When there are three operands left, we want > > - to make sure the ones that get the double > > - binary op are chosen wisely. */ > > - int len = ops.length (); > > - if (len >= 3) > > + { > > + /* When there are three operands left, we want > > + to make sure the ones that get the double > > + binary op are chosen wisely. */ > > + if (len >= 3 && !keep_fma_chain) > > swap_ops_for_binary_stmt (ops, len - 3); > > > > new_lhs = rewrite_expr_tree (stmt, rhs_code, 0, ops, > > powi_result != NULL > > || negate_result, > > len != orig_len); > > - } > > - > > + } > > /* If we combined some repeated factors into a > > __builtin_powi call, multiply that result by the > > reassociated operands. */ > > -- > > 2.25.1 > >