Re: [PR 81616] Deferring FMA transformations in tight loops

Richard Biener Fri, 12 Jan 2018 04:15:07 -0800

On Wed, 10 Jan 2018, Martin Jambor wrote:

> Hello,
> 
> I would really like to ping the FMA transformation prevention patch that
> I sent here in December, which, after incorporating a suggestion from
> Richi, re-base and re-testing, I re-post below.  I really think that it
> should make into gcc 8 in some form, because the performance wins are
> really big.
> 
> I am still opened to all sorts of comments, of course, especially to
> suggestions about the form of the param controlling this behavior (or
> how to communicate that we want to do this on Zen in general).  It might
> even be a binary value since not forming FMAs does not seem to harm
> bigger vectors either (as far as we know, I should add).
> 
> For the record, I have not yet received any information from AMD why
> FMAs on 256bit vectors do not have this problem, I assume all people
> that could give an authoritative answer are now looking into covert
> channel mitigation stuff.  But very probably it is just that the
> internal split FMAs can be scheduled so that while one is still waiting
> for its addend, another can already execute.


OK, and sorry for the delay.

Thanks,
Richard.

> Thanks,
> 
> Martin
> 
> 
> On Fri, Dec 15 2017, Martin Jambor wrote:
> > Hello,
> >
> > the patch below prevents creation if fused-multiply-and-add instructions
> > in the widening_mul gimple pass on the Zen-based AMD CPUs and as a
> > result fixes regressions of native znver1 tuning when compared to
> > generic tuning in:
> >
> >   - the matrix.c testcase of PR 81616 (straightforward matrix
> >     multiplication) at -O2 and -O3 which is currently 60% (!),
> >
> >   - SPEC 2006 454.calculix at -O2, which is currently over 20%, and
> >
> >   - SPEC 2017 510.parest at -O2 and -Ofast, which is currently also
> >     about 20% in both cases.
> >
> > The basic idea is to detect loops in the following form:
> >
> >     <bb 6>
> >     # accumulator_111 = PHI <0.0(5), accumulator_66(6)>
> >     ...
> >     _65 = _14 * _16;
> >     accumulator_66 = _65 + accumulator_111;
> >
> > and prevents from creating FMA for it.  Because at least in the parest
> > and calculix cases it has to, it also deals with more than one chain of
> > FMA candidates that feed the next one's addend:
> >
> >
> >     <bb 6>
> >     # accumulator_111 = PHI <0.0(5), accumulator_66(6)>
> >     ...
> >     _65 = _14 * _16;
> >     accumulator_55 = _65 + accumulator_111;
> >     _65 = _24 * _36;
> >     accumulator_66 = _65 + accumulator_55;
> >
> > Unfortunately, to really get rid of the calculix regression, the
> > algorithm cannot just look at one BB at a time but also has to work for
> > cases like the following:
> >
> >      1  void mult(void)
> >      2  {
> >      3      int i, j, k, l;
> >      4  
> >      5     for(i=0; i<SIZE; ++i)
> >      6     {
> >      7        for(j=0; j<SIZE; ++j)
> >      8        {
> >      9           for(l=0; l<SIZE; l+=10)
> >     10           {
> >     11               c[i][j] += a[i][l] * b[k][l];
> >     12               for(k=1; k<10; ++k)
> >     13               {
> >     14                   c[i][j] += a[i][l+k] * b[k][l+k];
> >     15               }
> >     16  
> >     17           }
> >     18        }
> >     19     }
> >     20  }
> >
> > where the FMA on line 14 feeds into the one on line 11 in an
> > encompassing loop.  Therefore I have changed the structure of the pass
> > to work in reverse dominance order and it keeps a hash set of results of
> > rejected FMAs candidates which it checks when looking at PHI nodes of
> > the current BB.  Without this reorganization, calculix was still 8%
> > slower with native tuning than with generic one.
> >
> > When the deferring mechanism realizes that in the current BB, the FMA
> > candidates do not all form a one chain tight-loop like in the examples
> > above, it goes back to all the previously deferred candidates (in the
> > current BB only) and performs the transformation.
> >
> > The main reason is to keep the patch conservative (and also simple), but
> > it also means that the following function is not affected and is still
> > 20% slower when compiled with native march and tuning compared to the
> > generic one:
> >
> >      1  void mult(struct s *p1, struct s *p2)
> >      2  {
> >      3     int i, j, k;
> >      4  
> >      5     for(i=0; i<SIZE; ++i)
> >      6     {
> >      7        for(j=0; j<SIZE; ++j)
> >      8        {
> >      9           for(k=0; k<SIZE; ++k)
> >     10           {
> >     11              p1->c[i][j] += p1->a[i][k] * p1->b[k][j];
> >     12              p2->c[i][j] += p2->a[i][k] * p2->b[k][j];
> >     13           }
> >     14        }
> >     15     }
> >     16  }
> >
> > I suppose that the best optimization for the above would be to split the
> > loops, but one could probably construct at least an artificial testcase
> > where the FMAs would keep enough locality that it is not the case.  The
> > mechanism can be easily extended to keep track of not just one chain but
> > a few, preferably as a followup, if people think it makes sense.
> >
> > An interesting observation is that the matrix multiplication does not
> > suffer the penalty when compiled with -O3 -mprefer-vector-width=256.
> > Apparently the 256 vector processing can hide the latency penalty when
> > internally it is split into two halves.  The same goes for 512 bit
> > vectors.  That is why the patch leaves those be - well, there is a param
> > for the threshold which is set to zero for everybody but znver1.  If
> > maintainers of any other architecture suspect that their FMAs might
> > suffer similar latency problem, they can easily try tweaking that
> > parameter and see what happens with the matrix multiplication example.
> >
> > I have bootstrapped and tested the patch on x86_64-linux (as it is and
> > also with the param set to a 256 by default to make it trigger).  I have
> > also measured run-times of all benchmarks in SPEC 2006 FP and SPEC 2017
> > FPrate and the only changes are the big improvements of calculix and
> > parest.
> >
> 
> 2018-01-09  Martin Jambor  <mjam...@suse.cz>
> 
>       PR target/81616
>       * params.def: New parameter PARAM_AVOID_FMA_MAX_BITS.
>       * tree-ssa-math-opts.c: Include domwalk.h.
>       (convert_mult_to_fma_1): New function.
>       (fma_transformation_info): New type.
>       (fma_deferring_state): Likewise.
>       (cancel_fma_deferring): New function.
>       (result_of_phi): Likewise.
>       (last_fma_candidate_feeds_initial_phi): Likewise.
>       (convert_mult_to_fma): Added deferring logic, split actual
>       transformation to convert_mult_to_fma_1.
>       (math_opts_dom_walker): New type.
>       (math_opts_dom_walker::after_dom_children): New method, body moved
>       here from pass_optimize_widening_mul::execute, added deferring logic
>       bits.
>       (pass_optimize_widening_mul::execute): Moved most of code to
>       math_opts_dom_walker::after_dom_children.
>       * config/i386/x86-tune.def (X86_TUNE_AVOID_128FMA_CHAINS): New.
>       * config/i386/i386.c (ix86_option_override_internal): Added
>       maybe_setting of PARAM_AVOID_FMA_MAX_BITS.
> ---
>  gcc/config/i386/i386.c       |   5 +
>  gcc/config/i386/x86-tune.def |   4 +
>  gcc/params.def               |   5 +
>  gcc/tree-ssa-math-opts.c     | 517 
> ++++++++++++++++++++++++++++++++-----------
>  4 files changed, 403 insertions(+), 128 deletions(-)
> 
> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> index 8696f931806..8200a136dd1 100644
> --- a/gcc/config/i386/i386.c
> +++ b/gcc/config/i386/i386.c
> @@ -4900,6 +4900,11 @@ ix86_option_override_internal (bool main_args_p,
>       (cf_protection_level) (opts->x_flag_cf_protection | CF_SET);
>      }
>  
> +  if (ix86_tune_features [X86_TUNE_AVOID_128FMA_CHAINS])
> +    maybe_set_param_value (PARAM_AVOID_FMA_MAX_BITS, 128,
> +                        opts->x_param_values,
> +                        opts_set->x_param_values);
> +
>    return true;
>  }
>  
> diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
> index 9401d51cdc1..0effce759f1 100644
> --- a/gcc/config/i386/x86-tune.def
> +++ b/gcc/config/i386/x86-tune.def
> @@ -399,6 +399,10 @@ DEF_TUNE (X86_TUNE_SLOW_PSHUFB, "slow_pshufb",
>  DEF_TUNE (X86_TUNE_AVOID_4BYTE_PREFIXES, "avoid_4byte_prefixes",
>            m_SILVERMONT | m_INTEL)
>  
> +/* X86_TUNE_AVOID_128FMA_CHAINS: Avoid creating loops with tight 128bit or
> +   smaller FMA chain.  */
> +DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1)
> +
>  
> /*****************************************************************************/
>  /* AVX instruction selection tuning (some of SSE flags affects AVX, too)     
> */
>  
> /*****************************************************************************/
> diff --git a/gcc/params.def b/gcc/params.def
> index d9f8d8208a1..a0cd06db339 100644
> --- a/gcc/params.def
> +++ b/gcc/params.def
> @@ -1326,6 +1326,11 @@ DEFPARAM(PARAM_UNROLL_JAM_MAX_UNROLL,
>        "Maximum unroll factor for the unroll-and-jam transformation.",
>        4, 0, 0)
>  
> +DEFPARAM(PARAM_AVOID_FMA_MAX_BITS,
> +      "avoid-fma-max-bits",
> +      "Maximum number of bits for which we avoid creating FMAs.",
> +      0, 0, 512)
> +
>  /*
>  
>  Local variables:
> diff --git a/gcc/tree-ssa-math-opts.c b/gcc/tree-ssa-math-opts.c
> index ea880c7b1d8..16d9399af0b 100644
> --- a/gcc/tree-ssa-math-opts.c
> +++ b/gcc/tree-ssa-math-opts.c
> @@ -115,6 +115,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "optabs-libfuncs.h"
>  #include "tree-eh.h"
>  #include "targhooks.h"
> +#include "domwalk.h"
>  
>  /* This structure represents one basic block that either computes a
>     division, or is a common dominator for basic block that compute a
> @@ -2639,17 +2640,214 @@ convert_plusminus_to_widen (gimple_stmt_iterator 
> *gsi, gimple *stmt,
>    return true;
>  }
>  
> +/* Given a result MUL_RESULT which is a result of a multiplication of OP1 and
> +   OP2 and which we know is used in statements that can be, together with the
> +   multiplication, converted to FMAs, perform the transformation.  */
> +
> +static void
> +convert_mult_to_fma_1 (tree mul_result, tree op1, tree op2)
> +{
> +  tree type = TREE_TYPE (mul_result);
> +  gimple *use_stmt;
> +  imm_use_iterator imm_iter;
> +  gassign *fma_stmt;
> +
> +  FOR_EACH_IMM_USE_STMT (use_stmt, imm_iter, mul_result)
> +    {
> +      gimple_stmt_iterator gsi = gsi_for_stmt (use_stmt);
> +      enum tree_code use_code;
> +      tree addop, mulop1 = op1, result = mul_result;
> +      bool negate_p = false;
> +
> +      if (is_gimple_debug (use_stmt))
> +     continue;
> +
> +      use_code = gimple_assign_rhs_code (use_stmt);
> +      if (use_code == NEGATE_EXPR)
> +     {
> +       result = gimple_assign_lhs (use_stmt);
> +       use_operand_p use_p;
> +       gimple *neguse_stmt;
> +       single_imm_use (gimple_assign_lhs (use_stmt), &use_p, &neguse_stmt);
> +       gsi_remove (&gsi, true);
> +       release_defs (use_stmt);
> +
> +       use_stmt = neguse_stmt;
> +       gsi = gsi_for_stmt (use_stmt);
> +       use_code = gimple_assign_rhs_code (use_stmt);
> +       negate_p = true;
> +     }
> +
> +      if (gimple_assign_rhs1 (use_stmt) == result)
> +     {
> +       addop = gimple_assign_rhs2 (use_stmt);
> +       /* a * b - c -> a * b + (-c)  */
> +       if (gimple_assign_rhs_code (use_stmt) == MINUS_EXPR)
> +         addop = force_gimple_operand_gsi (&gsi,
> +                                           build1 (NEGATE_EXPR,
> +                                                   type, addop),
> +                                           true, NULL_TREE, true,
> +                                           GSI_SAME_STMT);
> +     }
> +      else
> +     {
> +       addop = gimple_assign_rhs1 (use_stmt);
> +       /* a - b * c -> (-b) * c + a */
> +       if (gimple_assign_rhs_code (use_stmt) == MINUS_EXPR)
> +         negate_p = !negate_p;
> +     }
> +
> +      if (negate_p)
> +     mulop1 = force_gimple_operand_gsi (&gsi,
> +                                        build1 (NEGATE_EXPR,
> +                                                type, mulop1),
> +                                        true, NULL_TREE, true,
> +                                        GSI_SAME_STMT);
> +
> +      fma_stmt = gimple_build_assign (gimple_assign_lhs (use_stmt),
> +                                   FMA_EXPR, mulop1, op2, addop);
> +
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +     {
> +       fprintf (dump_file, "Generated FMA ");
> +       print_gimple_stmt (dump_file, fma_stmt, 0, 0);
> +       fprintf (dump_file, "\n");
> +     }
> +
> +      gsi_replace (&gsi, fma_stmt, true);
> +      widen_mul_stats.fmas_inserted++;
> +    }
> +}
> +
> +/* Data necessary to perform the actual transformation from a multiplication
> +   and an addition to an FMA after decision is taken it should be done and to
> +   then delete the multiplication statement from the function IL.  */
> +
> +struct fma_transformation_info
> +{
> +  gimple *mul_stmt;
> +  tree mul_result;
> +  tree op1;
> +  tree op2;
> +};
> +
> +/* Structure containing the current state of FMA deferring, i.e. whether we 
> are
> +   deferring, whether to continue deferring, and all data necessary to come
> +   back and perform all deferred transformations.  */
> +
> +class fma_deferring_state
> +{
> +public:
> +  /* Class constructor.  Pass true as PERFORM_DEFERRING in order to actually
> +     do any deferring.  */
> +
> +  fma_deferring_state (bool perform_deferring)
> +    : m_candidates (), m_mul_result_set (), m_initial_phi (NULL),
> +      m_last_result (NULL_TREE), m_deferring_p (perform_deferring) {}
> +
> +  /* List of FMA candidates for which we the transformation has been 
> determined
> +     possible but we at this point in BB analysis we do not consider them
> +     beneficial.  */
> +  auto_vec<fma_transformation_info, 8> m_candidates;
> +
> +  /* Set of results of multiplication that are part of an already deferred 
> FMA
> +     candidates.  */
> +  hash_set<tree> m_mul_result_set;
> +
> +  /* The PHI that supposedly feeds back result of a FMA to another over loop
> +     boundary.  */
> +  gphi *m_initial_phi;
> +
> +  /* Result of the last produced FMA candidate or NULL if there has not been
> +     one.  */
> +  tree m_last_result;
> +
> +  /* If true, deferring might still be profitable.  If false, transform all
> +     candidates and no longer defer.  */
> +  bool m_deferring_p;
> +};
> +
> +/* Transform all deferred FMA candidates and mark STATE as no longer
> +   deferring.  */
> +
> +static void
> +cancel_fma_deferring (fma_deferring_state *state)
> +{
> +  if (!state->m_deferring_p)
> +    return;
> +
> +  for (unsigned i = 0; i < state->m_candidates.length (); i++)
> +    {
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +     fprintf (dump_file, "Generating deferred FMA\n");
> +
> +      const fma_transformation_info &fti = state->m_candidates[i];
> +      convert_mult_to_fma_1 (fti.mul_result, fti.op1, fti.op2);
> +
> +      gimple_stmt_iterator gsi = gsi_for_stmt (fti.mul_stmt);
> +      gsi_remove (&gsi, true);
> +      release_defs (fti.mul_stmt);
> +    }
> +  state->m_deferring_p = false;
> +}
> +
> +/* If OP is an SSA name defined by a PHI node, return the PHI statement.
> +   Otherwise return NULL.  */
> +
> +static gphi *
> +result_of_phi (tree op)
> +{
> +  if (TREE_CODE (op) != SSA_NAME)
> +    return NULL;
> +
> +  return dyn_cast <gphi *> (SSA_NAME_DEF_STMT (op));
> +}
> +
> +/* After processing statements of a BB and recording STATE, return true if 
> the
> +   initial phi is fed by the last FMA candidate result ore one such result 
> from
> +   previously processed BBs marked in LAST_RESULT_SET.  */
> +
> +static bool
> +last_fma_candidate_feeds_initial_phi (fma_deferring_state *state,
> +                                   hash_set<tree> *last_result_set)
> +{
> +  ssa_op_iter iter;
> +  use_operand_p use;
> +  FOR_EACH_PHI_ARG (use, state->m_initial_phi, iter, SSA_OP_USE)
> +    {
> +      tree t = USE_FROM_PTR (use);
> +      if (t == state->m_last_result
> +       || last_result_set->contains (t))
> +     return true;
> +    }
> +
> +  return false;
> +}
> +
>  /* Combine the multiplication at MUL_STMT with operands MULOP1 and MULOP2
>     with uses in additions and subtractions to form fused multiply-add
> -   operations.  Returns true if successful and MUL_STMT should be removed.  
> */
> +   operations.  Returns true if successful and MUL_STMT should be removed.
> +
> +   If STATE indicates that we are deferring FMA transformation, that means
> +   that we do not produce FMAs for basic blocks which look like:
> +
> +    <bb 6>
> +    # accumulator_111 = PHI <0.0(5), accumulator_66(6)>
> +    _65 = _14 * _16;
> +    accumulator_66 = _65 + accumulator_111;
> +
> +  or its unrolled version, i.e. with several FMA candidates that feed result
> +  of one into the addend of another.  Instead, we add them to a list in STATE
> +  and if we later discover an FMA candidate that is not part of such a chain,
> +  we go back and perform all deferred past candidates.  */
>  
>  static bool
> -convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2)
> +convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2,
> +                  fma_deferring_state *state)
>  {
>    tree mul_result = gimple_get_lhs (mul_stmt);
>    tree type = TREE_TYPE (mul_result);
>    gimple *use_stmt, *neguse_stmt;
> -  gassign *fma_stmt;
>    use_operand_p use_p;
>    imm_use_iterator imm_iter;
>  
> @@ -2673,6 +2871,11 @@ convert_mult_to_fma (gimple *mul_stmt, tree op1, tree 
> op2)
>    if (has_zero_uses (mul_result))
>      return false;
>  
> +  bool check_defer
> +    = (state->m_deferring_p
> +       && (tree_to_shwi (TYPE_SIZE (type))
> +        <= PARAM_VALUE (PARAM_AVOID_FMA_MAX_BITS)));
> +  bool defer = check_defer;
>    /* Make sure that the multiplication statement becomes dead after
>       the transformation, thus that all uses are transformed to FMAs.
>       This means we assume that an FMA operation has the same cost
> @@ -2770,77 +2973,92 @@ convert_mult_to_fma (gimple *mul_stmt, tree op1, tree 
> op2)
>           }
>       }
>  
> +      tree use_rhs1 = gimple_assign_rhs1 (use_stmt);
> +      tree use_rhs2 = gimple_assign_rhs2 (use_stmt);
>        /* We can't handle a * b + a * b.  */
> -      if (gimple_assign_rhs1 (use_stmt) == gimple_assign_rhs2 (use_stmt))
> +      if (use_rhs1 == use_rhs2)
> +     return false;
> +      /* If deferring, make sure we are not looking at an instruction that
> +      wouldn't have existed if we were not.  */
> +      if (state->m_deferring_p
> +       && (state->m_mul_result_set.contains (use_rhs1)
> +           || state->m_mul_result_set.contains (use_rhs2)))
>       return false;
>  
> -      /* While it is possible to validate whether or not the exact form
> -      that we've recognized is available in the backend, the assumption
> -      is that the transformation is never a loss.  For instance, suppose
> -      the target only has the plain FMA pattern available.  Consider
> -      a*b-c -> fma(a,b,-c): we've exchanged MUL+SUB for FMA+NEG, which
> -      is still two operations.  Consider -(a*b)-c -> fma(-a,b,-c): we
> -      still have 3 operations, but in the FMA form the two NEGs are
> -      independent and could be run in parallel.  */
> -    }
> -
> -  FOR_EACH_IMM_USE_STMT (use_stmt, imm_iter, mul_result)
> -    {
> -      gimple_stmt_iterator gsi = gsi_for_stmt (use_stmt);
> -      enum tree_code use_code;
> -      tree addop, mulop1 = op1, result = mul_result;
> -      bool negate_p = false;
> -
> -      if (is_gimple_debug (use_stmt))
> -     continue;
> -
> -      use_code = gimple_assign_rhs_code (use_stmt);
> -      if (use_code == NEGATE_EXPR)
> +      if (check_defer)
>       {
> -       result = gimple_assign_lhs (use_stmt);
> -       single_imm_use (gimple_assign_lhs (use_stmt), &use_p, &neguse_stmt);
> -       gsi_remove (&gsi, true);
> -       release_defs (use_stmt);
> +       tree use_lhs = gimple_assign_lhs (use_stmt);
> +       if (state->m_last_result)
> +         {
> +           if (use_rhs2 == state->m_last_result
> +               || use_rhs1 == state->m_last_result)
> +             defer = true;
> +           else
> +             defer = false;
> +         }
> +       else
> +         {
> +           gcc_checking_assert (!state->m_initial_phi);
> +           gphi *phi;
> +           if (use_rhs1 == result)
> +             phi = result_of_phi (use_rhs2);
> +           else
> +             {
> +               gcc_assert (use_rhs2 == result);
> +               phi = result_of_phi (use_rhs1);
> +             }
>  
> -       use_stmt = neguse_stmt;
> -       gsi = gsi_for_stmt (use_stmt);
> -       use_code = gimple_assign_rhs_code (use_stmt);
> -       negate_p = true;
> -     }
> +           if (phi)
> +             {
> +               state->m_initial_phi = phi;
> +               defer = true;
> +             }
> +           else
> +             defer = false;
> +         }
>  
> -      if (gimple_assign_rhs1 (use_stmt) == result)
> -     {
> -       addop = gimple_assign_rhs2 (use_stmt);
> -       /* a * b - c -> a * b + (-c)  */
> -       if (gimple_assign_rhs_code (use_stmt) == MINUS_EXPR)
> -         addop = force_gimple_operand_gsi (&gsi,
> -                                           build1 (NEGATE_EXPR,
> -                                                   type, addop),
> -                                           true, NULL_TREE, true,
> -                                           GSI_SAME_STMT);
> +       state->m_last_result = use_lhs;
> +       check_defer = false;
>       }
>        else
> +     defer = false;
> +
> +      /* While it is possible to validate whether or not the exact form that
> +      we've recognized is available in the backend, the assumption is that
> +      if the deferring logic above did not trigger, the transformation is
> +      never a loss.  For instance, suppose the target only has the plain FMA
> +      pattern available.  Consider a*b-c -> fma(a,b,-c): we've exchanged
> +      MUL+SUB for FMA+NEG, which is still two operations.  Consider
> +         -(a*b)-c -> fma(-a,b,-c): we still have 3 operations, but in the FMA
> +      form the two NEGs are independent and could be run in parallel.  */
> +    }
> +
> +  if (defer)
> +    {
> +      fma_transformation_info fti;
> +      fti.mul_stmt = mul_stmt;
> +      fti.mul_result = mul_result;
> +      fti.op1 = op1;
> +      fti.op2 = op2;
> +      state->m_candidates.safe_push (fti);
> +      state->m_mul_result_set.add (mul_result);
> +
> +      if (dump_file && (dump_flags & TDF_DETAILS))
>       {
> -       addop = gimple_assign_rhs1 (use_stmt);
> -       /* a - b * c -> (-b) * c + a */
> -       if (gimple_assign_rhs_code (use_stmt) == MINUS_EXPR)
> -         negate_p = !negate_p;
> +       fprintf (dump_file, "Deferred generating FMA for multiplication ");
> +       print_gimple_stmt (dump_file, mul_stmt, 0, 0);
> +       fprintf (dump_file, "\n");
>       }
>  
> -      if (negate_p)
> -     mulop1 = force_gimple_operand_gsi (&gsi,
> -                                        build1 (NEGATE_EXPR,
> -                                                type, mulop1),
> -                                        true, NULL_TREE, true,
> -                                        GSI_SAME_STMT);
> -
> -      fma_stmt = gimple_build_assign (gimple_assign_lhs (use_stmt),
> -                                   FMA_EXPR, mulop1, op2, addop);
> -      gsi_replace (&gsi, fma_stmt, true);
> -      widen_mul_stats.fmas_inserted++;
> +      return false;
> +    }
> +  else
> +    {
> +      if (state->m_deferring_p)
> +     cancel_fma_deferring (state);
> +      convert_mult_to_fma_1 (mul_result, op1, op2);
> +      return true;
>      }
> -
> -  return true;
>  }
>  
>  
> @@ -3270,92 +3488,135 @@ public:
>  
>  }; // class pass_optimize_widening_mul
>  
> -unsigned int
> -pass_optimize_widening_mul::execute (function *fun)
> +/* Walker class to perform the transformation in reverse dominance order. */
> +
> +class math_opts_dom_walker : public dom_walker
>  {
> -  basic_block bb;
> -  bool cfg_changed = false;
> +public:
> +  /* Constructor, CFG_CHANGED is a pointer to a boolean flag that will be set
> +     if walking modidifes the CFG.  */
>  
> -  memset (&widen_mul_stats, 0, sizeof (widen_mul_stats));
> -  calculate_dominance_info (CDI_DOMINATORS);
> -  renumber_gimple_stmt_uids ();
> +  math_opts_dom_walker (bool *cfg_changed_p)
> +    : dom_walker (CDI_DOMINATORS), m_last_result_set (),
> +      m_cfg_changed_p (cfg_changed_p) {}
>  
> -  FOR_EACH_BB_FN (bb, fun)
> +  /* The actual actions performed in the walk.  */
> +
> +  virtual void after_dom_children (basic_block);
> +
> +  /* Set of results of chains of multiply and add statement combinations that
> +     were not transformed into FMAs because of active deferring.  */
> +  hash_set<tree> m_last_result_set;
> +
> +  /* Pointer to a flag of the user that needs to be set if CFG has been
> +     modified.  */
> +  bool *m_cfg_changed_p;
> +};
> +
> +void
> +math_opts_dom_walker::after_dom_children (basic_block bb)
> +{
> +  gimple_stmt_iterator gsi;
> +
> +  fma_deferring_state fma_state (PARAM_VALUE (PARAM_AVOID_FMA_MAX_BITS) > 0);
> +
> +  for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi);)
>      {
> -      gimple_stmt_iterator gsi;
> +      gimple *stmt = gsi_stmt (gsi);
> +      enum tree_code code;
>  
> -      for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi);)
> -        {
> -       gimple *stmt = gsi_stmt (gsi);
> -       enum tree_code code;
> +      if (is_gimple_assign (stmt))
> +     {
> +       code = gimple_assign_rhs_code (stmt);
> +       switch (code)
> +         {
> +         case MULT_EXPR:
> +           if (!convert_mult_to_widen (stmt, &gsi)
> +               && !convert_expand_mult_copysign (stmt, &gsi)
> +               && convert_mult_to_fma (stmt,
> +                                       gimple_assign_rhs1 (stmt),
> +                                       gimple_assign_rhs2 (stmt),
> +                                       &fma_state))
> +             {
> +               gsi_remove (&gsi, true);
> +               release_defs (stmt);
> +               continue;
> +             }
> +           break;
> +
> +         case PLUS_EXPR:
> +         case MINUS_EXPR:
> +           if (!convert_plusminus_to_widen (&gsi, stmt, code))
> +             match_uaddsub_overflow (&gsi, stmt, code);
> +           break;
>  
> -       if (is_gimple_assign (stmt))
> +         case TRUNC_MOD_EXPR:
> +           convert_to_divmod (as_a<gassign *> (stmt));
> +           break;
> +
> +         default:;
> +         }
> +     }
> +      else if (is_gimple_call (stmt))
> +     {
> +       tree fndecl = gimple_call_fndecl (stmt);
> +       if (fndecl && gimple_call_builtin_p (stmt, BUILT_IN_NORMAL))
>           {
> -           code = gimple_assign_rhs_code (stmt);
> -           switch (code)
> +           switch (DECL_FUNCTION_CODE (fndecl))
>               {
> -             case MULT_EXPR:
> -               if (!convert_mult_to_widen (stmt, &gsi)
> -                   && !convert_expand_mult_copysign (stmt, &gsi)
> +             case BUILT_IN_POWF:
> +             case BUILT_IN_POW:
> +             case BUILT_IN_POWL:
> +               if (gimple_call_lhs (stmt)
> +                   && TREE_CODE (gimple_call_arg (stmt, 1)) == REAL_CST
> +                   && real_equal
> +                   (&TREE_REAL_CST (gimple_call_arg (stmt, 1)),
> +                    &dconst2)
>                     && convert_mult_to_fma (stmt,
> -                                           gimple_assign_rhs1 (stmt),
> -                                           gimple_assign_rhs2 (stmt)))
> +                                           gimple_call_arg (stmt, 0),
> +                                           gimple_call_arg (stmt, 0),
> +                                           &fma_state))
>                   {
> -                   gsi_remove (&gsi, true);
> +                   unlink_stmt_vdef (stmt);
> +                   if (gsi_remove (&gsi, true)
> +                       && gimple_purge_dead_eh_edges (bb))
> +                     *m_cfg_changed_p = true;
>                     release_defs (stmt);
>                     continue;
>                   }
>                 break;
>  
> -             case PLUS_EXPR:
> -             case MINUS_EXPR:
> -               if (!convert_plusminus_to_widen (&gsi, stmt, code))
> -                 match_uaddsub_overflow (&gsi, stmt, code);
> -               break;
> -
> -             case TRUNC_MOD_EXPR:
> -               convert_to_divmod (as_a<gassign *> (stmt));
> -               break;
> -
>               default:;
>               }
>           }
> -       else if (is_gimple_call (stmt)
> -                && gimple_call_lhs (stmt))
> -         {
> -           tree fndecl = gimple_call_fndecl (stmt);
> -           if (fndecl
> -               && gimple_call_builtin_p (stmt, BUILT_IN_NORMAL))
> -             {
> -               switch (DECL_FUNCTION_CODE (fndecl))
> -                 {
> -                   case BUILT_IN_POWF:
> -                   case BUILT_IN_POW:
> -                   case BUILT_IN_POWL:
> -                     if (TREE_CODE (gimple_call_arg (stmt, 1)) == REAL_CST
> -                         && real_equal
> -                              (&TREE_REAL_CST (gimple_call_arg (stmt, 1)),
> -                               &dconst2)
> -                         && convert_mult_to_fma (stmt,
> -                                                 gimple_call_arg (stmt, 0),
> -                                                 gimple_call_arg (stmt, 0)))
> -                       {
> -                         unlink_stmt_vdef (stmt);
> -                         if (gsi_remove (&gsi, true)
> -                             && gimple_purge_dead_eh_edges (bb))
> -                           cfg_changed = true;
> -                         release_defs (stmt);
> -                         continue;
> -                       }
> -                       break;
> -
> -                   default:;
> -                 }
> -             }
> -         }
> -       gsi_next (&gsi);
> +       else
> +         cancel_fma_deferring (&fma_state);
>       }
> +      gsi_next (&gsi);
>      }
> +  if (fma_state.m_deferring_p
> +      && fma_state.m_initial_phi)
> +    {
> +      gcc_checking_assert (fma_state.m_last_result);
> +      if (!last_fma_candidate_feeds_initial_phi (&fma_state,
> +                                              &m_last_result_set))
> +     cancel_fma_deferring (&fma_state);
> +      else
> +     m_last_result_set.add (fma_state.m_last_result);
> +    }
> +}
> +
> +
> +unsigned int
> +pass_optimize_widening_mul::execute (function *fun)
> +{
> +  bool cfg_changed = false;
> +
> +  memset (&widen_mul_stats, 0, sizeof (widen_mul_stats));
> +  calculate_dominance_info (CDI_DOMINATORS);
> +  renumber_gimple_stmt_uids ();
> +
> +  math_opts_dom_walker (&cfg_changed).walk (ENTRY_BLOCK_PTR_FOR_FN (cfun));
>  
>    statistics_counter_event (fun, "widening multiplications inserted",
>                           widen_mul_stats.widen_mults_inserted);
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 
21284 (AG Nuernberg)

Re: [PR 81616] Deferring FMA transformations in tight loops

Reply via email to