Re: [PATCH] Optimise sqrt reciprocal multiplications

H.J. Lu Tue, 11 Sep 2018 10:05:11 -0700

On Wed, Sep 5, 2018 at 4:38 AM, Kyrill  Tkachov
<kyrylo.tkac...@foss.arm.com> wrote:
> On 04/09/18 17:52, Kyrill Tkachov wrote:
>>
>> On 04/09/18 15:31, Richard Biener wrote:
>>> On Tue, 4 Sep 2018, Kyrill Tkachov wrote:
>>>> Hi Richard,
>>>>
>>>> On 31/08/18 12:07, Richard Biener wrote:
>>>>> On Thu, 30 Aug 2018, Kyrill Tkachov wrote:
>>>>>> Ping.
>>>>>>
>>>>>> https://gcc.gnu.org/ml/gcc-patches/2018-08/msg01496.html
>>>>>>
>>>>>> Thanks,
>>>>>> Kyrill
>>>>>>
>>>>>> On 23/08/18 18:09, Kyrill Tkachov wrote:
>>>>>>> Hi Richard,
>>>>>>>
>>>>>>> On 23/08/18 11:13, Richard Sandiford wrote:
>>>>>>>> Kyrill  Tkachov <kyrylo.tkac...@foss.arm.com> writes:
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> This patch aims to optimise sequences involving uses of 1.0 / sqrt
>>>>>>>>> (a)
>>>>>>>>> under -freciprocal-math and -funsafe-math-optimizations.
>>>>>>>>> In particular consider:
>>>>>>>>>
>>>>>>>>> x = 1.0 / sqrt (a);
>>>>>>>>> r1 = x * x;  // same as 1.0 / a
>>>>>>>>> r2 = a * x; // same as sqrt (a)
>>>>>>>>>
>>>>>>>>> If x, r1 and r2 are all used further on in the code, this can be
>>>>>>>>> transformed into:
>>>>>>>>> tmp1 = 1.0 / a
>>>>>>>>> tmp2 = sqrt (a)
>>>>>>>>> tmp3 = tmp1 * tmp2
>>>>>>>>> x = tmp3
>>>>>>>>> r1 = tmp1
>>>>>>>>> r2 = tmp2
>>>>>>>> Nice optimisation :-)  Someone who knows the pass better should
>>>>>>>> review,
>>>>>>>> but:
>>>>>>> Thanks for the review.
>>>>>>>> There seems to be an implicit assumption that this is a win even
>>>>>>>> when the r1 and r2 assignments are only conditionally executed.
>>>>>>>> That's probably true, but it might be worth saying explicitly.
>>>>>>> I'll admit I had not considered that case.
>>>>>>> I think it won't make a difference in practice, as the really
>>>>>>> expensive
>>>>>>> operations here
>>>>>>> are the sqrt and the division and they are on the executed path in
>>>>>>> either
>>>>>>> case and them
>>>>>>> becoming independent should be a benefit of its own.
>>>>>>>>> +/* Return TRUE if USE_STMT is a multiplication of DEF by A.  */
>>>>>>>>> +
>>>>>>>>> +static inline bool
>>>>>>>>> +is_mult_by (gimple *use_stmt, tree def, tree a)
>>>>>>>>> +{
>>>>>>>>> +  if (gimple_code (use_stmt) == GIMPLE_ASSIGN
>>>>>>>>> +      && gimple_assign_rhs_code (use_stmt) == MULT_EXPR)
>>>>>>>>> +    {
>>>>>>>>> +      tree op0 = gimple_assign_rhs1 (use_stmt);
>>>>>>>>> +      tree op1 = gimple_assign_rhs2 (use_stmt);
>>>>>>>>> +
>>>>>>>>> +      return (op0 == def && op1 == a)
>>>>>>>>> +          || (op0 == a && op1 == def);
>>>>>>>>> +    }
>>>>>>>>> +  return 0;
>>>>>>>>> +}
>>>>>>>> Seems like is_square_of could now be a light-weight wrapper around
>>>>>>>> this.
>>>>>>> Indeed, I've done the wrapping now.
>>>>>>>>> @@ -652,6 +669,180 @@ execute_cse_reciprocals_1
>>>>>>>>> (gimple_stmt_iterator
>>>>>>>>> *def_gsi, tree def)
>>>>>>>>>      occ_head = NULL;
>>>>>>>>>    }
>>>>>>>>>    +/* Transform sequences like
>>>>>>>>> +   x = 1.0 / sqrt (a);
>>>>>>>>> +   r1 = x * x;
>>>>>>>>> +   r2 = a * x;
>>>>>>>>> +   into:
>>>>>>>>> +   tmp1 = 1.0 / a;
>>>>>>>>> +   tmp2 = sqrt (a);
>>>>>>>>> +   tmp3 = tmp1 * tmp2;
>>>>>>>>> +   x = tmp3;
>>>>>>>>> +   r1 = tmp1;
>>>>>>>>> +   r2 = tmp2;
>>>>>>>>> +   depending on the uses of x, r1, r2.  This removes one
>>>>>>>>> multiplication
>>>>>>>>> and
>>>>>>>>> +   allows the sqrt and division operations to execute in parallel.
>>>>>>>>> +   DEF_GSI is the gsi of the initial division by sqrt that defines
>>>>>>>>> +   DEF (x in the example abovs). */
>>>>>>>>> +
>>>>>>>>> +static void
>>>>>>>>> +optimize_recip_sqrt (gimple_stmt_iterator *def_gsi, tree def)
>>>>>>>>> +{
>>>>>>>>> +  use_operand_p use_p;
>>>>>>>>> +  imm_use_iterator use_iter;
>>>>>>>>> +  gimple *stmt = gsi_stmt (*def_gsi);
>>>>>>>>> +  tree x = def;
>>>>>>>>> +  tree orig_sqrt_ssa_name = gimple_assign_rhs2 (stmt);
>>>>>>>>> +  tree div_rhs1 = gimple_assign_rhs1 (stmt);
>>>>>>>>> +
>>>>>>>>> +  if (TREE_CODE (orig_sqrt_ssa_name) != SSA_NAME
>>>>>>>>> +      || TREE_CODE (div_rhs1) != REAL_CST
>>>>>>>>> +      || !real_equal (&TREE_REAL_CST (div_rhs1), &dconst1))
>>>>>>>>> +    return;
>>>>>>>>> +
>>>>>>>>> +  gimple *sqrt_stmt = SSA_NAME_DEF_STMT (orig_sqrt_ssa_name);
>>>>>>>>> +  if (!is_gimple_call (sqrt_stmt)
>>>>>>>>> +      || !gimple_call_lhs (sqrt_stmt))
>>>>>>>>> +    return;
>>>>>>>>> +
>>>>>>>>> +  gcall *call = as_a <gcall *> (sqrt_stmt);
>>>>>>>> Very minor, but:
>>>>>>>>
>>>>>>>>     gcall *sqrt_stmt
>>>>>>>>       = dyn_cast <gcall *> (SSA_NAME_DEF_STMT (orig_sqrt_ssa_name));
>>>>>>>>     if (!sqrt_stmt || !gimple_call_lhs (sqrt_stmt))
>>>>>>>>       return;
>>>>>>>>
>>>>>>>> would avoid the need for the separate as_a<>, and would mean that
>>>>>>>> we only call gimple_call_* on gcalls.
>>>>>>> Ok.
>>>>>>>>> +  if (has_other_use)
>>>>>>>>> +    {
>>>>>>>>> +      /* Using the two temporaries tmp1, tmp2 from above
>>>>>>>>> +     the original x is now:
>>>>>>>>> +     x = tmp1 * tmp2.  */
>>>>>>>>> +      gcc_assert (mult_ssa_name);
>>>>>>>>> +      gcc_assert (sqr_ssa_name);
>>>>>>>>> +      gimple_stmt_iterator gsi2 = gsi_for_stmt (stmt);
>>>>>>>>> +
>>>>>>>>> +      tree new_ssa_name
>>>>>>>>> +    = make_temp_ssa_name (TREE_TYPE (a), NULL,
>>>>>>>>> "recip_sqrt_transformed");
>>>>>>>>> +      gimple *new_stmt
>>>>>>>>> +    = gimple_build_assign (new_ssa_name, MULT_EXPR,
>>>>>>>>> +                   mult_ssa_name, sqr_ssa_name);
>>>>>>>>> +      gsi_insert_before (&gsi2, new_stmt, GSI_SAME_STMT);
>>>>>>>>> +      gcc_assert (gsi_stmt (gsi2) == stmt);
>>>>>>>>> + gimple_assign_set_rhs_from_tree (&gsi2, new_ssa_name);
>>>>>>>>> +      fold_stmt (&gsi2);
>>>>>>>>> +      update_stmt (stmt);
>>>>>>>> In this case we're replacing the statement in its original position,
>>>>>>>> so there's no real need to use a temporary.  It seems better to
>>>>>>>> change the rhs_code, rhs1 and rhs2 of stmt in-place, with the same
>>>>>>>> lhs as before.
>>>>>>> Yes, that's cleaner.
>>>>>>>>> @@ -762,6 +953,23 @@ pass_cse_reciprocals::execute (function *fun)
>>>>>>>>>          if (optimize_bb_for_size_p (bb))
>>>>>>>>>            continue;
>>>>>>>> Seems unnecessary to skip the new optimisation when optimising for
>>>>>>>> size.
>>>>>>>> Like you say, it saves a multiplication overall. Also:
>>>>>>> Indeed.
>>>>>>>>> +      if (flag_unsafe_math_optimizations)
>>>>>>>>> +    {
>>>>>>>>> +      for (gimple_stmt_iterator gsi = gsi_after_labels (bb);
>>>>>>>>> +           !gsi_end_p (gsi);
>>>>>>>>> +           gsi_next (&gsi))
>>>>>>>>> +        {
>>>>>>>>> +          gimple *stmt = gsi_stmt (gsi);
>>>>>>>>> +
>>>>>>>>> +          if (gimple_has_lhs (stmt)
>>>>>>>>> +          && (def = SINGLE_SSA_TREE_OPERAND (stmt, SSA_OP_DEF)) !=
>>>>>>>>> NULL
>>>>>>>>> +          && FLOAT_TYPE_P (TREE_TYPE (def))
>>>>>>>>> +          && TREE_CODE (def) == SSA_NAME
>>>>>>>>> +          && is_gimple_assign (stmt)
>>>>>>>>> +          && gimple_assign_rhs_code (stmt) == RDIV_EXPR)
>>>>>>>>> +        optimize_recip_sqrt (&gsi, def);
>>>>>>>>> +        }
>>>>>>>>> +    }
>>>>>>>> It looks like this could safely be done in one of the existing walks
>>>>>>>> (e.g. the execute_cse_reciprocals_1 one, if we do this when
>>>>>>>> optimising
>>>>>>>> for size).
>>>>>>> You're right. I've moved this into one of the walks above this.
>>>>>>>
>>>>>>> Bootstrapped and tested on aarch64-none-linux-gnu and
>>>>>>> x86_64-unknown-linux-gnu.
>>>>>>> CC'ing richi as he's reviewed these kinds of patches in the past.
>>>>>>>
>>>>>>> Is this ok for trunk?
>>>>> I wonder how it interacts with execute_cse_reciprocals_1 given it
>>>>> introduces a division 1.0 / a which will be not processed by
>>>>> execute_cse_reciprocals_1 given that operates by walking uses of 'a'.
>>>>> That may be just a missed optimization of course.
>>>> Hmm, I believe right now it doesn't interact with
>>>> execute_cse_reciprocals_1 as
>>>> it's either one or the other.
>>>> I've left it as it is for now, but would you like us to call
>>>> execute_cse_reciprocals_1 on the potentially transformed division?
>>> I think that wouldn't "fix" it given execute_cse_reciprocals_1 works by
>>> seeing all reciprocal uses of 'a' so it may have alrady seen two and
>>> you introduced the third.  But we can leave "solving" this for
>>> future enhacements (I'd avoid doing two passes over the IL just because
>>> of said possibility right now).
>>>>> +  if (has_other_use)
>>>>> +    {
>>>>> +      /* Using the two temporaries tmp1, tmp2 from above
>>>>> +        the original x is now:
>>>>> +        x = tmp1 * tmp2.  */
>>>>> +      gcc_assert (mult_ssa_name);
>>>>> +      gcc_assert (sqr_ssa_name);
>>>>> +
>>>>> +      gimple_assign_set_rhs1 (stmt, mult_ssa_name);
>>>>> +      gimple_assign_set_rhs2 (stmt, sqr_ssa_name);
>>>>> +      gimple_assign_set_rhs_code (stmt, MULT_EXPR);
>>>>> +      fold_stmt_inplace (def_gsi);
>>>>> +      update_stmt (stmt);
>>>>> +    }
>>>>>
>>>>> so you are leaving the original stmt in place unchanged even if it is
>>>>> not used?  Why?  Note that with -fno-call-exceptions this stmt may
>>>>> throw, so you should arrange to code-generate 1./a in place of the
>>>>> original division to preserve EH behavior.  Watch out because then
>>>>> with other uses you have to find a place to insert its computation.
>>>> Ok. These are oversights on my part. I've updated the patch to modify
>>>> the
>>>> original division in place. The multiplication is placed after it.
>>> If the division throws internally you can't insert after it, you
>>> have to insert on the single non-EH edge outgoing from the BB
>>> with the division.  Just try a C++ testcase with -fnon-call-exceptions
>>> and a try {} catch(...) {} around.
>>>
>>> Not sure if it's worth to handle so bailing out if
>>> stmt_can_throw_internal (stmt) would be an option as well.
>>>>> +      if (is_square_of (stmt2, x))
>>>>> +       {
>>>>> +         if (!sqr_stmts.contains (stmt2))
>>>>> +           sqr_stmts.safe_push (stmt2);
>>>>> +       }
>>>>>
>>>>> this is quadratic in the number of square stmts... please consider
>>>>> making sqr_stmts a bitmap of SSA defs (so the stmt you have now
>>>>> is then SSA_NAME_DEF_STMT (ssa_name (bitmap-element))).
>>>> Done. In practice I didn't see there being more than one such use, I
>>>> expect
>>>> them
>>>> to be CSE'd, but maybe if they're in different basic blocks...
>>> Oh, it just occured to me you could use FOR_EACH_IMM_USE_STMT ()
>>> which will only get you each stmt once... ;)  Sorry for the misleading
>>> bitmap suggestion.
>>>>> You do not seem to restrict placement of the use stmts but insert
>>>>> before the 1/sqrt(a) stmt.  That possibly hoists the multiplications
>>>>> where they are not needed.  Consider
>>>>>
>>>>>    x = 1./sqrt(a);
>>>>>    if (l)
>>>>>      l1 = x * 3.;
>>>>>    else if (l2)
>>>>>      l1 = x * x;
>>>>>    else if (l3)
>>>>>      l1 = a * x;
>>>>>
>>>>> or similar where on the path to x * 3. you now perform two extra
>>>>> multiplications.
>>>> Ok, I've restricted the optimisation somewhat.
>>>> Now if there's an other use it won't perform the transformation unless
>>>> there
>>>> is already
>>>> a multiplication present on the main path. That way it won't introduce
>>>> multiplications
>>>> on paths where there aren't any.
>>> OK.
>>>>> Your testcases do not cover the case of other uses at all.  Or of
>>>>> EH.
>>>> gcc.dg/recip_sqrt_mult_1.c should handle the other uses case as tmp is a
>>>> global
>>>> being written to. I've added more testcases involving uses in different
>>>> basic
>>>> blocks
>>>> and a g++.dg testcase with -fnon-call-exceptions.
>>>> I'm not sure what functionality to test though apart from that it
>>>> doesn't ICE
>>>> and does
>>>> the transformations I expect.
>>> That's good enough I guess.  I see you do have a testcase with
>>> try/catch so I wonder why foo4 doesn't ICE when you insert after
>>> the throwing stmt...  maybe it doesn't throw?  Ah - they probably
>>> fail your new mult_on_main_path test?
>>
>> Yeah. I did manage to reproduce an ICE eventually by removing that check
>> and enabling -ftrapping-math.
>>
>> I think I'll just not enter this at all if stmt_can_throw_internal (stmt)
>> like you suggested.
>
>
> And here is the version that does that. It also reverts to using a vector
> for the sqr_stmts for consistency
> and uses FOR_EACH_IMM_USE_STMT to iterate over the use statements.
>
> Bootstrapped and tested on aarch64-none-linux-gnu.
>
> Ok?
>
> Thanks,
> Kyrill
>
> 2018-09-03  Kyrylo Tkachov  <kyrylo.tkac...@arm.com>
>
>     * tree-ssa-math-opts.c (is_mult_by): New function.
>     (is_square_of): Use the above.
>     (optimize_recip_sqrt): New function.
>     (pass_cse_reciprocals::execute): Use the above.
>
> 2018-09-03  Kyrylo Tkachov  <kyrylo.tkac...@arm.com>
>
>     * gcc.dg/recip_sqrt_mult_1.c: New test.
>     * gcc.dg/recip_sqrt_mult_2.c: Likewise.
>     * gcc.dg/recip_sqrt_mult_3.c: Likewise.
>     * gcc.dg/recip_sqrt_mult_4.c: Likewise.
>     * gcc.dg/recip_sqrt_mult_5.c: Likewise.
>     * g++.dg/recip_sqrt_mult_1.C: Likewise.
>     * g++.dg/recip_sqrt_mult_2.C: Likewise.


This caused:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87283

-- 
H.J.

Re: [PATCH] Optimise sqrt reciprocal multiplications

Reply via email to