On Wed, Sep 5, 2018 at 4:38 AM, Kyrill Tkachov <kyrylo.tkac...@foss.arm.com> wrote: > On 04/09/18 17:52, Kyrill Tkachov wrote: >> >> On 04/09/18 15:31, Richard Biener wrote: >>> On Tue, 4 Sep 2018, Kyrill Tkachov wrote: >>>> Hi Richard, >>>> >>>> On 31/08/18 12:07, Richard Biener wrote: >>>>> On Thu, 30 Aug 2018, Kyrill Tkachov wrote: >>>>>> Ping. >>>>>> >>>>>> https://gcc.gnu.org/ml/gcc-patches/2018-08/msg01496.html >>>>>> >>>>>> Thanks, >>>>>> Kyrill >>>>>> >>>>>> On 23/08/18 18:09, Kyrill Tkachov wrote: >>>>>>> Hi Richard, >>>>>>> >>>>>>> On 23/08/18 11:13, Richard Sandiford wrote: >>>>>>>> Kyrill Tkachov <kyrylo.tkac...@foss.arm.com> writes: >>>>>>>>> Hi all, >>>>>>>>> >>>>>>>>> This patch aims to optimise sequences involving uses of 1.0 / sqrt >>>>>>>>> (a) >>>>>>>>> under -freciprocal-math and -funsafe-math-optimizations. >>>>>>>>> In particular consider: >>>>>>>>> >>>>>>>>> x = 1.0 / sqrt (a); >>>>>>>>> r1 = x * x; // same as 1.0 / a >>>>>>>>> r2 = a * x; // same as sqrt (a) >>>>>>>>> >>>>>>>>> If x, r1 and r2 are all used further on in the code, this can be >>>>>>>>> transformed into: >>>>>>>>> tmp1 = 1.0 / a >>>>>>>>> tmp2 = sqrt (a) >>>>>>>>> tmp3 = tmp1 * tmp2 >>>>>>>>> x = tmp3 >>>>>>>>> r1 = tmp1 >>>>>>>>> r2 = tmp2 >>>>>>>> Nice optimisation :-) Someone who knows the pass better should >>>>>>>> review, >>>>>>>> but: >>>>>>> Thanks for the review. >>>>>>>> There seems to be an implicit assumption that this is a win even >>>>>>>> when the r1 and r2 assignments are only conditionally executed. >>>>>>>> That's probably true, but it might be worth saying explicitly. >>>>>>> I'll admit I had not considered that case. >>>>>>> I think it won't make a difference in practice, as the really >>>>>>> expensive >>>>>>> operations here >>>>>>> are the sqrt and the division and they are on the executed path in >>>>>>> either >>>>>>> case and them >>>>>>> becoming independent should be a benefit of its own. >>>>>>>>> +/* Return TRUE if USE_STMT is a multiplication of DEF by A. */ >>>>>>>>> + >>>>>>>>> +static inline bool >>>>>>>>> +is_mult_by (gimple *use_stmt, tree def, tree a) >>>>>>>>> +{ >>>>>>>>> + if (gimple_code (use_stmt) == GIMPLE_ASSIGN >>>>>>>>> + && gimple_assign_rhs_code (use_stmt) == MULT_EXPR) >>>>>>>>> + { >>>>>>>>> + tree op0 = gimple_assign_rhs1 (use_stmt); >>>>>>>>> + tree op1 = gimple_assign_rhs2 (use_stmt); >>>>>>>>> + >>>>>>>>> + return (op0 == def && op1 == a) >>>>>>>>> + || (op0 == a && op1 == def); >>>>>>>>> + } >>>>>>>>> + return 0; >>>>>>>>> +} >>>>>>>> Seems like is_square_of could now be a light-weight wrapper around >>>>>>>> this. >>>>>>> Indeed, I've done the wrapping now. >>>>>>>>> @@ -652,6 +669,180 @@ execute_cse_reciprocals_1 >>>>>>>>> (gimple_stmt_iterator >>>>>>>>> *def_gsi, tree def) >>>>>>>>> occ_head = NULL; >>>>>>>>> } >>>>>>>>> +/* Transform sequences like >>>>>>>>> + x = 1.0 / sqrt (a); >>>>>>>>> + r1 = x * x; >>>>>>>>> + r2 = a * x; >>>>>>>>> + into: >>>>>>>>> + tmp1 = 1.0 / a; >>>>>>>>> + tmp2 = sqrt (a); >>>>>>>>> + tmp3 = tmp1 * tmp2; >>>>>>>>> + x = tmp3; >>>>>>>>> + r1 = tmp1; >>>>>>>>> + r2 = tmp2; >>>>>>>>> + depending on the uses of x, r1, r2. This removes one >>>>>>>>> multiplication >>>>>>>>> and >>>>>>>>> + allows the sqrt and division operations to execute in parallel. >>>>>>>>> + DEF_GSI is the gsi of the initial division by sqrt that defines >>>>>>>>> + DEF (x in the example abovs). */ >>>>>>>>> + >>>>>>>>> +static void >>>>>>>>> +optimize_recip_sqrt (gimple_stmt_iterator *def_gsi, tree def) >>>>>>>>> +{ >>>>>>>>> + use_operand_p use_p; >>>>>>>>> + imm_use_iterator use_iter; >>>>>>>>> + gimple *stmt = gsi_stmt (*def_gsi); >>>>>>>>> + tree x = def; >>>>>>>>> + tree orig_sqrt_ssa_name = gimple_assign_rhs2 (stmt); >>>>>>>>> + tree div_rhs1 = gimple_assign_rhs1 (stmt); >>>>>>>>> + >>>>>>>>> + if (TREE_CODE (orig_sqrt_ssa_name) != SSA_NAME >>>>>>>>> + || TREE_CODE (div_rhs1) != REAL_CST >>>>>>>>> + || !real_equal (&TREE_REAL_CST (div_rhs1), &dconst1)) >>>>>>>>> + return; >>>>>>>>> + >>>>>>>>> + gimple *sqrt_stmt = SSA_NAME_DEF_STMT (orig_sqrt_ssa_name); >>>>>>>>> + if (!is_gimple_call (sqrt_stmt) >>>>>>>>> + || !gimple_call_lhs (sqrt_stmt)) >>>>>>>>> + return; >>>>>>>>> + >>>>>>>>> + gcall *call = as_a <gcall *> (sqrt_stmt); >>>>>>>> Very minor, but: >>>>>>>> >>>>>>>> gcall *sqrt_stmt >>>>>>>> = dyn_cast <gcall *> (SSA_NAME_DEF_STMT (orig_sqrt_ssa_name)); >>>>>>>> if (!sqrt_stmt || !gimple_call_lhs (sqrt_stmt)) >>>>>>>> return; >>>>>>>> >>>>>>>> would avoid the need for the separate as_a<>, and would mean that >>>>>>>> we only call gimple_call_* on gcalls. >>>>>>> Ok. >>>>>>>>> + if (has_other_use) >>>>>>>>> + { >>>>>>>>> + /* Using the two temporaries tmp1, tmp2 from above >>>>>>>>> + the original x is now: >>>>>>>>> + x = tmp1 * tmp2. */ >>>>>>>>> + gcc_assert (mult_ssa_name); >>>>>>>>> + gcc_assert (sqr_ssa_name); >>>>>>>>> + gimple_stmt_iterator gsi2 = gsi_for_stmt (stmt); >>>>>>>>> + >>>>>>>>> + tree new_ssa_name >>>>>>>>> + = make_temp_ssa_name (TREE_TYPE (a), NULL, >>>>>>>>> "recip_sqrt_transformed"); >>>>>>>>> + gimple *new_stmt >>>>>>>>> + = gimple_build_assign (new_ssa_name, MULT_EXPR, >>>>>>>>> + mult_ssa_name, sqr_ssa_name); >>>>>>>>> + gsi_insert_before (&gsi2, new_stmt, GSI_SAME_STMT); >>>>>>>>> + gcc_assert (gsi_stmt (gsi2) == stmt); >>>>>>>>> + gimple_assign_set_rhs_from_tree (&gsi2, new_ssa_name); >>>>>>>>> + fold_stmt (&gsi2); >>>>>>>>> + update_stmt (stmt); >>>>>>>> In this case we're replacing the statement in its original position, >>>>>>>> so there's no real need to use a temporary. It seems better to >>>>>>>> change the rhs_code, rhs1 and rhs2 of stmt in-place, with the same >>>>>>>> lhs as before. >>>>>>> Yes, that's cleaner. >>>>>>>>> @@ -762,6 +953,23 @@ pass_cse_reciprocals::execute (function *fun) >>>>>>>>> if (optimize_bb_for_size_p (bb)) >>>>>>>>> continue; >>>>>>>> Seems unnecessary to skip the new optimisation when optimising for >>>>>>>> size. >>>>>>>> Like you say, it saves a multiplication overall. Also: >>>>>>> Indeed. >>>>>>>>> + if (flag_unsafe_math_optimizations) >>>>>>>>> + { >>>>>>>>> + for (gimple_stmt_iterator gsi = gsi_after_labels (bb); >>>>>>>>> + !gsi_end_p (gsi); >>>>>>>>> + gsi_next (&gsi)) >>>>>>>>> + { >>>>>>>>> + gimple *stmt = gsi_stmt (gsi); >>>>>>>>> + >>>>>>>>> + if (gimple_has_lhs (stmt) >>>>>>>>> + && (def = SINGLE_SSA_TREE_OPERAND (stmt, SSA_OP_DEF)) != >>>>>>>>> NULL >>>>>>>>> + && FLOAT_TYPE_P (TREE_TYPE (def)) >>>>>>>>> + && TREE_CODE (def) == SSA_NAME >>>>>>>>> + && is_gimple_assign (stmt) >>>>>>>>> + && gimple_assign_rhs_code (stmt) == RDIV_EXPR) >>>>>>>>> + optimize_recip_sqrt (&gsi, def); >>>>>>>>> + } >>>>>>>>> + } >>>>>>>> It looks like this could safely be done in one of the existing walks >>>>>>>> (e.g. the execute_cse_reciprocals_1 one, if we do this when >>>>>>>> optimising >>>>>>>> for size). >>>>>>> You're right. I've moved this into one of the walks above this. >>>>>>> >>>>>>> Bootstrapped and tested on aarch64-none-linux-gnu and >>>>>>> x86_64-unknown-linux-gnu. >>>>>>> CC'ing richi as he's reviewed these kinds of patches in the past. >>>>>>> >>>>>>> Is this ok for trunk? >>>>> I wonder how it interacts with execute_cse_reciprocals_1 given it >>>>> introduces a division 1.0 / a which will be not processed by >>>>> execute_cse_reciprocals_1 given that operates by walking uses of 'a'. >>>>> That may be just a missed optimization of course. >>>> Hmm, I believe right now it doesn't interact with >>>> execute_cse_reciprocals_1 as >>>> it's either one or the other. >>>> I've left it as it is for now, but would you like us to call >>>> execute_cse_reciprocals_1 on the potentially transformed division? >>> I think that wouldn't "fix" it given execute_cse_reciprocals_1 works by >>> seeing all reciprocal uses of 'a' so it may have alrady seen two and >>> you introduced the third. But we can leave "solving" this for >>> future enhacements (I'd avoid doing two passes over the IL just because >>> of said possibility right now). >>>>> + if (has_other_use) >>>>> + { >>>>> + /* Using the two temporaries tmp1, tmp2 from above >>>>> + the original x is now: >>>>> + x = tmp1 * tmp2. */ >>>>> + gcc_assert (mult_ssa_name); >>>>> + gcc_assert (sqr_ssa_name); >>>>> + >>>>> + gimple_assign_set_rhs1 (stmt, mult_ssa_name); >>>>> + gimple_assign_set_rhs2 (stmt, sqr_ssa_name); >>>>> + gimple_assign_set_rhs_code (stmt, MULT_EXPR); >>>>> + fold_stmt_inplace (def_gsi); >>>>> + update_stmt (stmt); >>>>> + } >>>>> >>>>> so you are leaving the original stmt in place unchanged even if it is >>>>> not used? Why? Note that with -fno-call-exceptions this stmt may >>>>> throw, so you should arrange to code-generate 1./a in place of the >>>>> original division to preserve EH behavior. Watch out because then >>>>> with other uses you have to find a place to insert its computation. >>>> Ok. These are oversights on my part. I've updated the patch to modify >>>> the >>>> original division in place. The multiplication is placed after it. >>> If the division throws internally you can't insert after it, you >>> have to insert on the single non-EH edge outgoing from the BB >>> with the division. Just try a C++ testcase with -fnon-call-exceptions >>> and a try {} catch(...) {} around. >>> >>> Not sure if it's worth to handle so bailing out if >>> stmt_can_throw_internal (stmt) would be an option as well. >>>>> + if (is_square_of (stmt2, x)) >>>>> + { >>>>> + if (!sqr_stmts.contains (stmt2)) >>>>> + sqr_stmts.safe_push (stmt2); >>>>> + } >>>>> >>>>> this is quadratic in the number of square stmts... please consider >>>>> making sqr_stmts a bitmap of SSA defs (so the stmt you have now >>>>> is then SSA_NAME_DEF_STMT (ssa_name (bitmap-element))). >>>> Done. In practice I didn't see there being more than one such use, I >>>> expect >>>> them >>>> to be CSE'd, but maybe if they're in different basic blocks... >>> Oh, it just occured to me you could use FOR_EACH_IMM_USE_STMT () >>> which will only get you each stmt once... ;) Sorry for the misleading >>> bitmap suggestion. >>>>> You do not seem to restrict placement of the use stmts but insert >>>>> before the 1/sqrt(a) stmt. That possibly hoists the multiplications >>>>> where they are not needed. Consider >>>>> >>>>> x = 1./sqrt(a); >>>>> if (l) >>>>> l1 = x * 3.; >>>>> else if (l2) >>>>> l1 = x * x; >>>>> else if (l3) >>>>> l1 = a * x; >>>>> >>>>> or similar where on the path to x * 3. you now perform two extra >>>>> multiplications. >>>> Ok, I've restricted the optimisation somewhat. >>>> Now if there's an other use it won't perform the transformation unless >>>> there >>>> is already >>>> a multiplication present on the main path. That way it won't introduce >>>> multiplications >>>> on paths where there aren't any. >>> OK. >>>>> Your testcases do not cover the case of other uses at all. Or of >>>>> EH. >>>> gcc.dg/recip_sqrt_mult_1.c should handle the other uses case as tmp is a >>>> global >>>> being written to. I've added more testcases involving uses in different >>>> basic >>>> blocks >>>> and a g++.dg testcase with -fnon-call-exceptions. >>>> I'm not sure what functionality to test though apart from that it >>>> doesn't ICE >>>> and does >>>> the transformations I expect. >>> That's good enough I guess. I see you do have a testcase with >>> try/catch so I wonder why foo4 doesn't ICE when you insert after >>> the throwing stmt... maybe it doesn't throw? Ah - they probably >>> fail your new mult_on_main_path test? >> >> Yeah. I did manage to reproduce an ICE eventually by removing that check >> and enabling -ftrapping-math. >> >> I think I'll just not enter this at all if stmt_can_throw_internal (stmt) >> like you suggested. > > > And here is the version that does that. It also reverts to using a vector > for the sqr_stmts for consistency > and uses FOR_EACH_IMM_USE_STMT to iterate over the use statements. > > Bootstrapped and tested on aarch64-none-linux-gnu. > > Ok? > > Thanks, > Kyrill > > 2018-09-03 Kyrylo Tkachov <kyrylo.tkac...@arm.com> > > * tree-ssa-math-opts.c (is_mult_by): New function. > (is_square_of): Use the above. > (optimize_recip_sqrt): New function. > (pass_cse_reciprocals::execute): Use the above. > > 2018-09-03 Kyrylo Tkachov <kyrylo.tkac...@arm.com> > > * gcc.dg/recip_sqrt_mult_1.c: New test. > * gcc.dg/recip_sqrt_mult_2.c: Likewise. > * gcc.dg/recip_sqrt_mult_3.c: Likewise. > * gcc.dg/recip_sqrt_mult_4.c: Likewise. > * gcc.dg/recip_sqrt_mult_5.c: Likewise. > * g++.dg/recip_sqrt_mult_1.C: Likewise. > * g++.dg/recip_sqrt_mult_2.C: Likewise.
This caused: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87283 -- H.J.