Ping. Teresa On Fri, May 18, 2012 at 7:21 AM, Teresa Johnson <tejohn...@google.com> wrote: > Ping? > Teresa > > On Fri, May 11, 2012 at 6:11 AM, Teresa Johnson <tejohn...@google.com> wrote: >> Ping? >> Teresa >> >> On Fri, May 4, 2012 at 3:41 PM, Teresa Johnson <tejohn...@google.com> wrote: >>> >>> On David's suggestion, I have removed the changes that rename niter_desc >>> to >>> loop_desc from this patch to focus the patch on the unrolling changes. I >>> can >>> submit a cleanup patch to do the renaming as soon as this one goes in. >>> >>> Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk? >>> >>> Thanks, >>> Teresa >>> >>> Here is the new description of improvements from the original patch: >>> >>> Improved patch based on feedback. Main changes are: >>> >>> 1) Improve efficiency by caching loop analysis results in the loop >>> auxiliary >>> info structure hanging off the loop structure. Added a new routine, >>> analyze_loop_insns, to fill in information about the average and total >>> number >>> of branches, as well as whether there are any floating point set and call >>> instructions in the loop. The new routine is invoked when we first create >>> a >>> loop's niter_desc struct, and the caller (get_simple_loop_desc) has been >>> modified to handle creating a niter_desc for the fake outermost loop. >>> >>> 2) Improvements to max_unroll_with_branches: >>> - Treat the fake outermost loop (the procedure body) as we would a hot >>> outer >>> loop, i.e. compute the max unroll looking at its nested branches, instead >>> of >>> shutting off unrolling when we reach the fake outermost loop. >>> - Pull the checks previously done in the caller into the routine (e.g. >>> whether the loop iterates frequently or contains fp instructions). >>> - Fix a bug in the previous version that sometimes caused overflow in the >>> new unroll factor. >>> >>> 3) Remove float variables, and use integer computation to compute the >>> average number of branches in the loop. >>> >>> 4) Detect more types of floating point computations in the loop by walking >>> all set instructions, not just single sets. >>> >>> 2012-05-04 Teresa Johnson <tejohn...@google.com> >>> >>> * doc/invoke.texi: Update the documentation with new params. >>> * loop-unroll.c (max_unroll_with_branches): New function. >>> (decide_unroll_constant_iterations, >>> decide_unroll_runtime_iterations): >>> Add heuristic to avoid increasing branch mispredicts when >>> unrolling. >>> (decide_peel_simple, decide_unroll_stupid): Retrieve number of >>> branches from niter_desc instead of via function that walks loop. >>> * loop-iv.c (get_simple_loop_desc): Invoke new analyze_loop_insns >>> function, and add guards to enable this function to work for the >>> outermost loop. >>> * cfgloop.c (insn_has_fp_set, analyze_loop_insns): New functions. >>> (num_loop_branches): Remove. >>> * cfgloop.h (struct loop_desc): Added new fields to cache >>> additional >>> loop analysis information. >>> (num_loop_branches): Remove. >>> (analyze_loop_insns): Declare. >>> * params.def (PARAM_MIN_ITER_UNROLL_WITH_BRANCHES): New param. >>> (PARAM_UNROLL_OUTER_LOOP_BRANCH_BUDGET): Ditto. >>> >>> Index: doc/invoke.texi >>> =================================================================== >>> --- doc/invoke.texi (revision 187013) >>> +++ doc/invoke.texi (working copy) >>> @@ -8842,6 +8842,12 @@ The maximum number of insns of an unswitched loop. >>> @item max-unswitch-level >>> The maximum number of branches unswitched in a single loop. >>> >>> +@item min-iter-unroll-with-branches >>> +Minimum iteration count to ignore branch effects when unrolling. >>> + >>> +@item unroll-outer-loop-branch-budget >>> +Maximum number of branches allowed in hot outer loop region after unroll. >>> + >>> @item lim-expensive >>> The minimum cost of an expensive expression in the loop invariant motion. >>> >>> Index: loop-unroll.c >>> =================================================================== >>> --- loop-unroll.c (revision 187013) >>> +++ loop-unroll.c (working copy) >>> @@ -152,6 +152,99 @@ static void combine_var_copies_in_loop_exit (struc >>> basic_block); >>> static rtx get_expansion (struct var_to_expand *); >>> >>> +/* Compute the maximum number of times LOOP can be unrolled without >>> exceeding >>> + a branch budget, which can increase branch mispredictions. The number >>> of >>> + branches is computed by weighting each branch with its expected >>> execution >>> + probability through the loop based on profile data. If no profile >>> feedback >>> + data exists, simply return the current NUNROLL factor. */ >>> + >>> +static unsigned >>> +max_unroll_with_branches(struct loop *loop, unsigned nunroll) >>> +{ >>> + struct loop *outer; >>> + struct niter_desc *outer_desc = 0; >>> + int outer_niters = 1; >>> + int frequent_iteration_threshold; >>> + unsigned branch_budget; >>> + struct niter_desc *desc = get_simple_loop_desc (loop); >>> + >>> + /* Ignore loops with FP computation as these tend to benefit much more >>> + consistently from unrolling. */ >>> + if (desc->has_fp) >>> + return nunroll; >>> + >>> + frequent_iteration_threshold = PARAM_VALUE >>> (PARAM_MIN_ITER_UNROLL_WITH_BRANCHES); >>> + if (expected_loop_iterations (loop) >= (unsigned) >>> frequent_iteration_threshold) >>> + return nunroll; >>> + >>> + /* If there was no profile feedback data, av_num_branches will be 0 >>> + and we won't limit unrolling. If the av_num_branches is at most 1, >>> + also don't limit unrolling as the back-edge branch will not be >>> duplicated. */ >>> + if (desc->av_num_branches <= 1) >>> + return nunroll; >>> + >>> + /* Walk up the loop tree until we find a hot outer loop in which the >>> current >>> + loop is nested. At that point we will compute the number of times >>> the >>> + current loop can be unrolled based on the number of branches in the >>> hot >>> + outer loop. */ >>> + outer = loop_outer (loop); >>> + /* The loop structure contains a fake outermost loop, so this should >>> always >>> + be non-NULL for our current loop. */ >>> + gcc_assert (outer); >>> + >>> + /* Walk up the loop tree until we either find a hot outer loop or hit >>> the >>> + fake outermost loop at the root. */ >>> + while (true) >>> + { >>> + outer_desc = get_simple_loop_desc (outer); >>> + >>> + /* Stop if we hit the fake outermost loop at the root of the tree, >>> + which includes the whole procedure. */ >>> + if (!loop_outer (outer)) >>> + break; >>> + >>> + if (outer_desc->const_iter) >>> + outer_niters *= outer_desc->niter; >>> + else if (outer->header->count) >>> + outer_niters *= expected_loop_iterations (outer); >>> + >>> + /* If the outer loop has enough iterations to be considered hot, >>> then >>> + we can stop our upwards loop tree traversal and examine the >>> current >>> + outer loop. */ >>> + if (outer_niters >= frequent_iteration_threshold) >>> + break; >>> + >>> + outer = loop_outer (outer); >>> + } >>> + >>> + gcc_assert(outer); >>> + >>> + /* Assume that any call will cause the branch budget to be exceeded, >>> + and that we can't unroll the current loop without increasing >>> + mispredicts. */ >>> + if (outer_desc->has_call) >>> + return 0; >>> + >>> + /* Otherwise, compute the maximum number of times current loop can be >>> + unrolled without exceeding our branch budget. First we subtract >>> + off the outer loop's average branch count from the budget. Note >>> + that this includes the branches in the current loop. This yields >>> + the number of branches left in the budget for the unrolled copies. >>> + We divide this by the number of branches in the current loop that >>> + must be duplicated when we unroll, which is the total average >>> + number of branches minus the back-edge branch. This yields the >>> + number of new loop body copies that can be created by unrolling >>> + without exceeding the budget, to which we add 1 to get the unroll >>> + factor. Note that the "outermost loop" may be the whole procedure >>> + if we did not find a hot enough enclosing loop. */ >>> + branch_budget = PARAM_VALUE (PARAM_UNROLL_OUTER_LOOP_BRANCH_BUDGET); >>> + if (outer_desc->av_num_branches > branch_budget) >>> + return 0; >>> + /* We already returned early if desc->av_num_branches <= 1. */ >>> + return (branch_budget - outer_desc->av_num_branches) >>> + / (desc->av_num_branches - 1) + 1; >>> +} >>> + >>> /* Unroll and/or peel (depending on FLAGS) LOOPS. */ >>> void >>> unroll_and_peel_loops (int flags) >>> @@ -522,6 +615,7 @@ static void >>> decide_unroll_constant_iterations (struct loop *loop, int flags) >>> { >>> unsigned nunroll, nunroll_by_av, best_copies, best_unroll = 0, n_copies, >>> i; >>> + unsigned nunroll_branches; >>> struct niter_desc *desc; >>> >>> if (!(flags & UAP_UNROLL)) >>> @@ -565,6 +659,21 @@ decide_unroll_constant_iterations (struct loop *lo >>> return; >>> } >>> >>> + /* Be careful when unrolling loops with branches inside -- it can >>> increase >>> + the number of mispredicts. */ >>> + if (desc->num_branches > 1) >>> + { >>> + nunroll_branches = max_unroll_with_branches (loop, nunroll); >>> + if (nunroll > nunroll_branches) >>> + nunroll = nunroll_branches; >>> + if (nunroll <= 1) >>> + { >>> + if (dump_file) >>> + fprintf (dump_file, ";; Not unrolling, contains branches\n"); >>> + return; >>> + } >>> + } >>> + >>> /* Check whether the loop rolls enough to consider. */ >>> if (desc->niter < 2 * nunroll) >>> { >>> @@ -802,7 +911,7 @@ unroll_loop_constant_iterations (struct loop *loop >>> static void >>> decide_unroll_runtime_iterations (struct loop *loop, int flags) >>> { >>> - unsigned nunroll, nunroll_by_av, i; >>> + unsigned nunroll, nunroll_by_av, nunroll_branches, i; >>> struct niter_desc *desc; >>> >>> if (!(flags & UAP_UNROLL)) >>> @@ -856,6 +965,21 @@ decide_unroll_runtime_iterations (struct loop *loo >>> return; >>> } >>> >>> + /* Be careful when unrolling loops with branches inside -- it can >>> increase >>> + the number of mispredicts. */ >>> + if (desc->num_branches > 1) >>> + { >>> + nunroll_branches = max_unroll_with_branches (loop, nunroll); >>> + if (nunroll > nunroll_branches) >>> + nunroll = nunroll_branches; >>> + if (nunroll <= 1) >>> + { >>> + if (dump_file) >>> + fprintf (dump_file, ";; Not unrolling, contains branches\n"); >>> + return; >>> + } >>> + } >>> + >>> /* If we have profile feedback, check whether the loop rolls. */ >>> if ((loop->header->count >>> && expected_loop_iterations (loop) < 2 * nunroll) >>> @@ -1233,7 +1357,7 @@ decide_peel_simple (struct loop *loop, int flags) >>> >>> /* Do not simply peel loops with branches inside -- it increases number >>> of mispredicts. */ >>> - if (num_loop_branches (loop) > 1) >>> + if (desc->num_branches > 1) >>> { >>> if (dump_file) >>> fprintf (dump_file, ";; Not peeling, contains branches\n"); >>> @@ -1394,7 +1518,7 @@ decide_unroll_stupid (struct loop *loop, int flags >>> >>> /* Do not unroll loops with branches inside -- it increases number >>> of mispredicts. */ >>> - if (num_loop_branches (loop) > 1) >>> + if (desc->num_branches > 1) >>> { >>> if (dump_file) >>> fprintf (dump_file, ";; Not unrolling, contains branches\n"); >>> Index: loop-iv.c >>> =================================================================== >>> --- loop-iv.c (revision 187013) >>> +++ loop-iv.c (working copy) >>> @@ -2948,8 +2948,12 @@ get_simple_loop_desc (struct loop *loop) >>> /* At least desc->infinite is not always initialized by >>> find_simple_loop_exit. */ >>> desc = XCNEW (struct niter_desc); >>> - iv_analysis_loop_init (loop); >>> - find_simple_exit (loop, desc); >>> + if (loop->latch != EXIT_BLOCK_PTR) >>> + { >>> + iv_analysis_loop_init (loop); >>> + find_simple_exit (loop, desc); >>> + } >>> + analyze_loop_insns (loop, desc); >>> loop->aux = desc; >>> >>> if (desc->simple_p && (desc->assumptions || desc->infinite)) >>> Index: cfgloop.c >>> =================================================================== >>> --- cfgloop.c (revision 187013) >>> +++ cfgloop.c (working copy) >>> @@ -1156,24 +1156,98 @@ get_loop_exit_edges (const struct loop *loop) >>> return edges; >>> } >>> >>> -/* Counts the number of conditional branches inside LOOP. */ >>> +/* Determine if INSN is a floating point set. */ >>> >>> -unsigned >>> -num_loop_branches (const struct loop *loop) >>> +static bool >>> +insn_has_fp_set(rtx insn) >>> { >>> - unsigned i, n; >>> - basic_block * body; >>> + int i; >>> + rtx pat = PATTERN(insn); >>> + if (GET_CODE (pat) == SET) >>> + return (FLOAT_MODE_P (GET_MODE (SET_DEST (pat)))); >>> + else if (GET_CODE (pat) == PARALLEL) >>> + { >>> + for (i = 0; i < XVECLEN (pat, 0); i++) >>> + { >>> + rtx sub = XVECEXP (pat, 0, i); >>> + if (GET_CODE (sub) == SET) >>> + return (FLOAT_MODE_P (GET_MODE (SET_DEST (sub)))); >>> + } >>> + } >>> + return false; >>> +} >>> >>> - gcc_assert (loop->latch != EXIT_BLOCK_PTR); >>> +/* Analyzes the instructions inside LOOP, updating the DESC. Currently >>> counts >>> + the number of conditional branch instructions, calls and fp >>> instructions, >>> + as well as the average number of branches executed per iteration. */ >>> >>> +void >>> +analyze_loop_insns (const struct loop *loop, struct niter_desc *desc) >>> +{ >>> + unsigned i, nbranch; >>> + gcov_type weighted_nbranch; >>> + bool has_call, has_fp; >>> + basic_block * body, bb; >>> + rtx insn; >>> + gcov_type header_count = loop->header->count; >>> + >>> + nbranch = weighted_nbranch = 0; >>> + has_call = has_fp = false; >>> + >>> body = get_loop_body (loop); >>> - n = 0; >>> for (i = 0; i < loop->num_nodes; i++) >>> - if (EDGE_COUNT (body[i]->succs) >= 2) >>> - n++; >>> + { >>> + bb = body[i]; >>> + >>> + if (EDGE_COUNT (bb->succs) >= 2) >>> + { >>> + nbranch++; >>> + >>> + /* If this block is executed less frequently than the header >>> (loop >>> + entry), then it is weighted based on its execution count, >>> which >>> + will be turned into a ratio compared to the loop header >>> below. */ >>> + if (bb->count < header_count) >>> + weighted_nbranch += bb->count; >>> + >>> + /* When it is executed more frequently than the header (i.e. it >>> is >>> + in a nested inner loop), simply weight the branch the same >>> as the >>> + header execution count, so that it will contribute 1 branch >>> to >>> + the ratio computed below. */ >>> + else >>> + weighted_nbranch += header_count; >>> + } >>> + >>> + /* No need to iterate through the instructions below if >>> + both flags have already been set. */ >>> + if (has_call && has_fp) >>> + continue; >>> + >>> + FOR_BB_INSNS (bb, insn) >>> + { >>> + if (!INSN_P (insn)) >>> + continue; >>> + >>> + if (!has_call) >>> + has_call = CALL_P (insn); >>> + >>> + if (!has_fp) >>> + has_fp = insn_has_fp_set (insn); >>> + } >>> + } >>> free (body); >>> >>> - return n; >>> + desc->num_branches = nbranch; >>> + /* Now divide the weights computed above by the loop header execution >>> count, >>> + to compute the average number of branches through the loop. By >>> adding >>> + header_count/2 to the numerator we round to nearest with integer >>> + division. */ >>> + if (header_count != 0) >>> + desc->av_num_branches >>> + = (weighted_nbranch + header_count/2) / header_count; >>> + else >>> + desc->av_num_branches = 0; >>> + desc->has_call = has_call; >>> + desc->has_fp = has_fp; >>> } >>> >>> /* Adds basic block BB to LOOP. */ >>> Index: cfgloop.h >>> =================================================================== >>> --- cfgloop.h (revision 187013) >>> +++ cfgloop.h (working copy) >>> @@ -249,7 +249,6 @@ extern basic_block *get_loop_body_in_custom_order >>> >>> extern VEC (edge, heap) *get_loop_exit_edges (const struct loop *); >>> edge single_exit (const struct loop *); >>> -extern unsigned num_loop_branches (const struct loop *); >>> >>> extern edge loop_preheader_edge (const struct loop *); >>> extern edge loop_latch_edge (const struct loop *); >>> @@ -359,7 +358,8 @@ struct rtx_iv >>> }; >>> >>> /* The description of an exit from the loop and of the number of >>> iterations >>> - till we take the exit. */ >>> + till we take the exit. Also includes other information used primarily >>> + by the loop unroller. */ >>> >>> struct niter_desc >>> { >>> @@ -400,6 +400,18 @@ struct niter_desc >>> >>> /* The number of iterations of the loop. */ >>> rtx niter_expr; >>> + >>> + /* The number of branches in the loop. */ >>> + unsigned num_branches; >>> + >>> + /* The number of executed branches per iteration. */ >>> + unsigned av_num_branches; >>> + >>> + /* Whether the loop contains a call instruction. */ >>> + bool has_call; >>> + >>> + /* Whether the loop contains fp instructions. */ >>> + bool has_fp; >>> }; >>> >>> extern void iv_analysis_loop_init (struct loop *); >>> @@ -413,6 +425,7 @@ extern void iv_analysis_done (void); >>> >>> extern struct niter_desc *get_simple_loop_desc (struct loop *loop); >>> extern void free_simple_loop_desc (struct loop *loop); >>> +void analyze_loop_insns (const struct loop *, struct niter_desc *desc); >>> >>> static inline struct niter_desc * >>> simple_loop_desc (struct loop *loop) >>> Index: params.def >>> =================================================================== >>> --- params.def (revision 187013) >>> +++ params.def (working copy) >>> @@ -312,6 +312,15 @@ DEFPARAM(PARAM_MAX_UNROLL_ITERATIONS, >>> "The maximum depth of a loop nest we completely peel", >>> 8, 0, 0) >>> >>> +DEFPARAM(PARAM_MIN_ITER_UNROLL_WITH_BRANCHES, >>> + "min-iter-unroll-with-branches", >>> + "Minimum iteration count to ignore branch effects when unrolling", >>> + 50, 0, 0) >>> +DEFPARAM(PARAM_UNROLL_OUTER_LOOP_BRANCH_BUDGET, >>> + "unroll-outer-loop-branch-budget", >>> + "Maximum number of branches allowed in hot outer loop region after >>> unroll", >>> + 25, 0, 0) >>> + >>> /* The maximum number of insns of an unswitched loop. */ >>> DEFPARAM(PARAM_MAX_UNSWITCH_INSNS, >>> "max-unswitch-insns", >>> >>> -- >>> This patch is available for review at >>> http://codereview.appspot.com/6099055 >> >> >> >> >> -- >> Teresa Johnson | Software Engineer | tejohn...@google.com | 408-460-2413 > > > > -- > Teresa Johnson | Software Engineer | tejohn...@google.com | 408-460-2413
-- Teresa Johnson | Software Engineer | tejohn...@google.com | 408-460-2413