Ping. Teresa
On Wed, Jul 18, 2012 at 8:48 AM, Teresa Johnson <tejohn...@google.com> wrote: > Ping (retrying ping in plain text mode so that it goes through properly). > > Thanks, > Teresa > > On Wed, Jul 11, 2012 at 10:42 AM, Teresa Johnson <tejohn...@google.com> wrote: >> Ports some patches related to improving FDO program summary information >> and using it to guide loop unrolling from google branches to mainline. >> The patch is enhanced to add additional summary information to aid >> in determining hot/cold decisions. >> >> The original patch description is at: >> http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00437.html >> and further discussion about incorporating onto mainline is at: >> http://gcc.gnu.org/ml/gcc-patches/2012-06/threads.html#00414 >> >> Honza, can you take a look to see if this patch meets your needs? >> >> Full description: >> >> This patch adds new program summary information to the gcov >> profile files that indicate how many profiled counts compose >> the majority of the program's execution time. This is used to >> provide an indication of the overall code size of the program. >> >> The new profile summary information is then used to guide >> codesize based unroll and peel decisions, to prevent those >> optimizations from increasing code size too much when the >> program may be sensitive to icache effects. >> >> This patch also pulls in dependent portions of google/main r187660 that cache >> additional loop analysis results in the niter_desc auxiliary information >> hanging off the loop structure (the optimization portions of that >> change are not included here, and have an outstanding review request >> for mainline). >> >> Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk? >> >> Thanks, >> Teresa >> >> 2012-07-11 Teresa Johnson <tejohn...@google.com> >> >> * libgcc/libgcov.c (sort_by_reverse_gcov_value): New function. >> (gcov_compute_cutoff_values): Ditto. >> (gcov_exit): Call gcov_compute_cutoff_values and merge new summary >> information. >> * gcc/doc/invoke.texi (roll much): Document new options >> -fpeel-codesize-limit and -funroll-codesize-limit, and new params >> codesize-hotness-threshold and unrollpeel-hotness-threshold. >> * gcc/gcov-io.c (gcov_write_summary): Write new summary info. >> (gcov_read_summary): Read new summary info. >> * gcc/gcov-io.h (GCOV_TAG_SUMMARY_LENGTH): Update for new summary >> info. >> (struct gcov_ctr_summary): Add new summary info: num_hot_counters and >> hot_cutoff_value. >> * gcc/loop-unroll.c (code_size_limit_factor): New function. >> (decide_unroll_runtime_iterations): Call code_size_limit_factor >> to control the unroll factor, and retrieve number of branches from >> niter_desc instead of via function that walks loop. >> (decide_peel_simple, decide_unroll_stupid): Ditto. >> * gcc/coverage.c (read_counts_file): Propagate new summary info. >> * gcc/loop-iv.c (get_simple_loop_desc): Invoke new analyze_loop_insns >> function, and add guards to enable this function to work for the >> outermost loop. >> * gcc/common.opt: Add -fpeel-codesize-limit and >> -funroll-codesize-limit. >> * gcc/cfgloop.c (insn_has_fp_set, analyze_loop_insns): New functions. >> (num_loop_branches): Remove. >> * gcc/cfgloop.h (struct niter_desc): Added new fields to cache >> additional loop analysis information. >> (num_loop_branches): Remove. >> (analyze_loop_insns): Declare. >> * gcc/params.def (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD): Add. >> (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD): Ditto. >> * gcc/gcov-dump.c (tag_summary): Dump new summary info. >> >> Index: libgcc/libgcov.c >> =================================================================== >> --- libgcc/libgcov.c (revision 189413) >> +++ libgcc/libgcov.c (working copy) >> @@ -276,6 +276,120 @@ gcov_version (struct gcov_info *ptr, gcov_unsigned >> return 1; >> } >> >> +/* Used by qsort to sort gcov values in descending order. */ >> + >> +static int >> +sort_by_reverse_gcov_value (const void *pa, const void *pb) >> +{ >> + const gcov_type a = *(gcov_type const *)pa; >> + const gcov_type b = *(gcov_type const *)pb; >> + >> + if (b > a) >> + return 1; >> + else if (b == a) >> + return 0; >> + else >> + return -1; >> +} >> + >> +/* Determines the number of counters required to cover a given percentage >> + of the total sum of execution counts in the summary, which is then also >> + recorded in SUM. */ >> + >> +static void >> +gcov_compute_cutoff_values (struct gcov_summary *sum) >> +{ >> + struct gcov_info *gi_ptr; >> + const struct gcov_fn_info *gfi_ptr; >> + const struct gcov_ctr_info *ci_ptr; >> + struct gcov_ctr_summary *cs_ptr; >> + unsigned t_ix, f_ix, i, ctr_info_ix, index; >> + gcov_unsigned_t c_num; >> + gcov_type *value_array; >> + gcov_type cum, cum_cutoff; >> + char *cutoff_str; >> + unsigned cutoff_perc; >> + >> +#define CUM_CUTOFF_PERCENT_TIMES_10 999 >> + cutoff_str = getenv ("GCOV_HOTCODE_CUTOFF_TIMES_10"); >> + if (cutoff_str && strlen (cutoff_str)) >> + cutoff_perc = atoi (cutoff_str); >> + else >> + cutoff_perc = CUM_CUTOFF_PERCENT_TIMES_10; >> + >> + /* This currently only applies to arc counters. */ >> + t_ix = GCOV_COUNTER_ARCS; >> + >> + /* First check if there are any counts recorded for this counter. */ >> + cs_ptr = &(sum->ctrs[t_ix]); >> + if (!cs_ptr->num) >> + return; >> + >> + /* Determine the cumulative counter value at the specified cutoff >> + percentage and record the percentage for use by gcov consumers. >> + Check for overflow when sum_all is multiplied by the cutoff_perc, >> + and if so, do the divide first. */ >> + if (cs_ptr->sum_all*cutoff_perc < cs_ptr->sum_all) >> + /* Overflow, do the divide first. */ >> + cum_cutoff = cs_ptr->sum_all / 1000 * cutoff_perc; >> + else >> + /* Otherwise multiply first to get the correct value for small >> + values of sum_all. */ >> + cum_cutoff = (cs_ptr->sum_all * cutoff_perc) / 1000; >> + >> + /* Next, walk through all the per-object structures and save each of >> + the count values in value_array. */ >> + index = 0; >> + value_array = (gcov_type *) malloc (sizeof (gcov_type) * cs_ptr->num); >> + for (gi_ptr = gcov_list; gi_ptr; gi_ptr = gi_ptr->next) >> + { >> + if (!gi_ptr->merge[t_ix]) >> + continue; >> + >> + /* Find the appropriate index into the gcov_ctr_info array >> + for the counter we are currently working on based on the >> + existence of the merge function pointer for this object. */ >> + for (i = 0, ctr_info_ix = 0; i < t_ix; i++) >> + { >> + if (gi_ptr->merge[i]) >> + ctr_info_ix++; >> + } >> + for (f_ix = 0; f_ix != gi_ptr->n_functions; f_ix++) >> + { >> + gfi_ptr = gi_ptr->functions[f_ix]; >> + >> + if (!gfi_ptr || gfi_ptr->key != gi_ptr) >> + continue; >> + >> + ci_ptr = &gfi_ptr->ctrs[ctr_info_ix]; >> + /* Sanity check that there are enough entries in value_arry >> + for this function's counters. Gracefully handle the case when >> + there are not, in case something in the profile info is >> + corrupted. */ >> + c_num = ci_ptr->num; >> + if (index + c_num > cs_ptr->num) >> + c_num = cs_ptr->num - index; >> + /* Copy over this function's counter values. */ >> + memcpy (&value_array[index], ci_ptr->values, >> + sizeof (gcov_type) * c_num); >> + index += c_num; >> + } >> + } >> + >> + /* Sort all the counter values by descending value and finally >> + accumulate the values from hottest on down until reaching >> + the cutoff value computed earlier. */ >> + qsort (value_array, cs_ptr->num, sizeof (gcov_type), >> + sort_by_reverse_gcov_value); >> + for (cum = 0, c_num = 0; c_num < cs_ptr->num && cum < cum_cutoff; c_num++) >> + cum += value_array[c_num]; >> + /* Record the number of counters required to reach the cutoff value, >> + as well as the counter value that caused cum to reach the cutoff. */ >> + cs_ptr->num_hot_counters = c_num; >> + cs_ptr->hot_cutoff_value = c_num > 0 ? value_array[c_num-1] : 0; >> + free (value_array); >> +} >> + >> /* Dump the coverage counts. We merge with existing counts when >> possible, to avoid growing the .da files ad infinitum. We use this >> program's checksum to make sure we only accumulate whole program >> @@ -347,6 +461,7 @@ gcov_exit (void) >> } >> } >> } >> + gcov_compute_cutoff_values (&this_prg); >> >> { >> /* Check if the level of dirs to strip off specified. */ >> @@ -598,7 +713,14 @@ gcov_exit (void) >> if (gi_ptr->merge[t_ix]) >> { >> if (!cs_prg->runs++) >> - cs_prg->num = cs_tprg->num; >> + { >> + cs_prg->num = cs_tprg->num; >> + if (cs_tprg->num_hot_counters > cs_prg->num_hot_counters) >> + { >> + cs_prg->num_hot_counters = cs_tprg->num_hot_counters; >> + cs_prg->hot_cutoff_value = cs_tprg->hot_cutoff_value; >> + } >> + } >> cs_prg->sum_all += cs_tprg->sum_all; >> if (cs_prg->run_max < cs_tprg->run_max) >> cs_prg->run_max = cs_tprg->run_max; >> Index: gcc/doc/invoke.texi >> =================================================================== >> --- gcc/doc/invoke.texi (revision 189413) >> +++ gcc/doc/invoke.texi (working copy) >> @@ -384,7 +384,7 @@ Objective-C and Objective-C++ Dialects}. >> -fno-sched-interblock -fno-sched-spec -fno-signed-zeros @gol >> -fno-toplevel-reorder -fno-trapping-math -fno-zero-initialized-in-bss @gol >> -fomit-frame-pointer -foptimize-register-move -foptimize-sibling-calls @gol >> --fpartial-inlining -fpeel-loops -fpredictive-commoning @gol >> +-fpartial-inlining -fpeel-codesize-limit -fpeel-loops >> -fpredictive-commoning @gol >> -fprefetch-loop-arrays @gol >> -fprofile-correction -fprofile-dir=@var{path} -fprofile-generate @gol >> -fprofile-generate=@var{path} @gol >> @@ -416,7 +416,7 @@ Objective-C and Objective-C++ Dialects}. >> -ftree-reassoc @gol >> -ftree-sink -ftree-sra -ftree-switch-conversion -ftree-tail-merge @gol >> -ftree-ter -ftree-vect-loop-version -ftree-vectorize -ftree-vrp @gol >> --funit-at-a-time -funroll-all-loops -funroll-loops @gol >> +-funit-at-a-time -funroll-all-loops -funroll-loops -funroll-codesize-limit >> @gol >> -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops >> @gol >> -fvariable-expansion-in-unroller -fvect-cost-model -fvpt -fweb @gol >> -fwhole-program -fwpa -fuse-linker-plugin @gol >> @@ -8505,6 +8505,14 @@ the loop is entered. This usually makes programs >> @option{-funroll-all-loops} implies the same options as >> @option{-funroll-loops}. >> >> +@item -funroll-codesize-limit >> +@opindex funroll-codesize-limit >> +Limit loop unrolling of non-const non-FP loops in a profile feedback >> compilation >> +under estimates of a large code footprint. Enabled by default with >> +@option{-fprofile-use}. Code size and execution weight thresholds are >> controlled >> +by the @option{unrollpeel-codesize-threshold} and >> +@option{unrollpeel-hotness-threshold} parameters. >> + >> @item -fpeel-loops >> @opindex fpeel-loops >> Peels loops for which there is enough information that they do not >> @@ -8513,6 +8521,14 @@ roll much (from profile feedback). It also turns >> >> Enabled with @option{-fprofile-use}. >> >> +@item -fpeel-codesize-limit >> +@opindex fpeel-codesize-limit >> +Limit loop peeling of non-const non-FP loops in a profile feedback >> compilation >> +under estimates of a large code footprint. Enabled by default with >> +@option{-fprofile-use}. Code size and execution weight thresholds are >> controlled >> +by the @option{unrollpeel-codesize-threshold} and >> +@option{unrollpeel-hotness-threshold} parameters. >> + >> @item -fmove-loop-invariants >> @opindex fmove-loop-invariants >> Enables the loop invariant motion pass in the RTL loop optimizer. Enabled >> @@ -8864,6 +8880,15 @@ The maximum number of iterations of a loop to be s >> @item max-completely-peel-loop-nest-depth >> The maximum depth of a loop nest suitable for complete peeling. >> >> +@item unrollpeel-codesize-threshold >> +Maximum profile-based code size footprint estimate for loop unrolling and >> +peeling. >> + >> +@item unrollpeel-hotness-threshold >> +Maximum ratio of total execution count to loop entry block count under which >> +most profile-based code size estimates will be ignored for loop unrolling >> and >> +peeling. >> + >> @item max-unswitch-insns >> The maximum number of insns of an unswitched loop. >> >> Index: gcc/gcov-io.c >> =================================================================== >> --- gcc/gcov-io.c (revision 189413) >> +++ gcc/gcov-io.c (working copy) >> @@ -376,6 +376,8 @@ gcov_write_summary (gcov_unsigned_t tag, const str >> for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++) >> { >> gcov_write_unsigned (csum->num); >> + gcov_write_unsigned (csum->num_hot_counters); >> + gcov_write_counter (csum->hot_cutoff_value); >> gcov_write_unsigned (csum->runs); >> gcov_write_counter (csum->sum_all); >> gcov_write_counter (csum->run_max); >> @@ -495,6 +497,8 @@ gcov_read_summary (struct gcov_summary *summary) >> for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++) >> { >> csum->num = gcov_read_unsigned (); >> + csum->num_hot_counters = gcov_read_unsigned (); >> + csum->hot_cutoff_value = gcov_read_counter (); >> csum->runs = gcov_read_unsigned (); >> csum->sum_all = gcov_read_counter (); >> csum->run_max = gcov_read_counter (); >> Index: gcc/gcov-io.h >> =================================================================== >> --- gcc/gcov-io.h (revision 189413) >> +++ gcc/gcov-io.h (working copy) >> @@ -310,7 +310,7 @@ typedef HOST_WIDEST_INT gcov_type; >> #define GCOV_TAG_OBJECT_SUMMARY ((gcov_unsigned_t)0xa1000000) /* Obsolete >> */ >> #define GCOV_TAG_PROGRAM_SUMMARY ((gcov_unsigned_t)0xa3000000) >> #define GCOV_TAG_SUMMARY_LENGTH \ >> - (1 + GCOV_COUNTERS_SUMMABLE * (2 + 3 * 2)) >> + (1 + GCOV_COUNTERS_SUMMABLE * (3 + 4 * 2)) >> >> /* Counters that are collected. */ >> #define GCOV_COUNTER_ARCS 0 /* Arc transitions. */ >> @@ -393,6 +393,10 @@ typedef HOST_WIDEST_INT gcov_type; >> struct gcov_ctr_summary >> { >> gcov_unsigned_t num; /* number of counters. */ >> + gcov_unsigned_t num_hot_counters;/* number of counters to reach a given >> + percent of sum_all. */ >> + gcov_type hot_cutoff_value; /* value of smallest counter included in >> + num_hot_counters. */ >> gcov_unsigned_t runs; /* number of program runs */ >> gcov_type sum_all; /* sum of all counters accumulated. */ >> gcov_type run_max; /* maximum value on a single run. */ >> Index: gcc/loop-unroll.c >> =================================================================== >> --- gcc/loop-unroll.c (revision 189413) >> +++ gcc/loop-unroll.c (working copy) >> @@ -32,6 +32,7 @@ along with GCC; see the file COPYING3. If not see >> #include "hashtab.h" >> #include "recog.h" >> #include "target.h" >> +#include "gcov-io.h" >> >> /* This pass performs loop unrolling and peeling. We only perform these >> optimizations on innermost loops (with single exception) because >> @@ -150,6 +151,75 @@ static void combine_var_copies_in_loop_exit (struc >> basic_block); >> static rtx get_expansion (struct var_to_expand *); >> >> +/* Determine whether and how much LOOP unrolling/peeling should be >> constrained >> + based on code footprint estimates. Returns the codesize-based factor to >> be >> + divided into the max instructions in an unrolled or peeled loop: >> + 1) For size <= threshold, do not limit (by returning 1). >> + 2) For threshold < size < 2*threshold, reduce maximum allowed peeled or >> + unrolled instructions according to loop hotness. >> + 3) For threshold >= 2*threshold, disable unrolling/peeling (by returning >> + INT_MAX). */ >> + >> +static int >> +code_size_limit_factor(struct loop *loop) >> +{ >> + unsigned size_threshold; >> + struct niter_desc *desc = get_simple_loop_desc (loop); >> + gcov_type sum_to_header_ratio; >> + int hotness_ratio_threshold; >> + int limit_factor; >> + >> + /* First check if the application has a large codesize footprint. >> + This is estimated from FDO profile summary information for the >> + program, where the num_hot_counters indicates the number of hottest >> + counters (blocks) that compose most of the execution time of >> + the program. A large value would indicate a large flat execution >> + profile where icache misses may be a concern. */ >> + size_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD); >> + if (!profile_info >> + || profile_info->num_hot_counters <= size_threshold >> + || !profile_info->sum_all) >> + return 1; >> + >> + /* Next, exclude some loops where unrolling/peeling may be more >> + important to overall performance. */ >> + >> + /* Ignore FP loops, which are more likely to benefit heavily from >> + unrolling. */ >> + if (desc->has_fp) >> + return 1; >> + >> + /* Next, set the value of the codesize-based unroll factor divisor which >> in >> + most loops will need to be set to a value that will reduce or eliminate >> + unrolling/peeling. */ >> + if (profile_info->num_hot_counters < size_threshold * 2) >> + { >> + /* For applications that are less than twice the codesize limit, allow >> + limited unrolling for very hot loops. */ >> + sum_to_header_ratio = profile_info->sum_all / loop->header->count; >> + hotness_ratio_threshold = PARAM_VALUE >> (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD); >> + /* When the profile count sum to loop entry header ratio is smaller >> than >> + the threshold (i.e. the loop entry is hot enough, the divisor is >> set >> + to 1 so the unroll/peel factor is not reduced. When it is bigger >> + than the ratio, increase the divisor by the amount this ratio >> + is over the threshold, which will quickly reduce the unroll/peel >> + factor to zero as the loop's hotness reduces. */ >> + if (sum_to_header_ratio > hotness_ratio_threshold) >> + { >> + limit_factor = sum_to_header_ratio / hotness_ratio_threshold; >> + gcc_assert (limit_factor >= 1); >> + } >> + else >> + limit_factor = 1; >> + } >> + else >> + /* For appliations that are at least twice the codesize limit, set >> + the divisor to a large value that will force the unroll factor to 0. >> */ >> + limit_factor = INT_MAX; >> + >> + return limit_factor; >> +} >> + >> /* Unroll and/or peel (depending on FLAGS) LOOPS. */ >> void >> unroll_and_peel_loops (int flags) >> @@ -802,6 +872,7 @@ decide_unroll_runtime_iterations (struct loop *loo >> { >> unsigned nunroll, nunroll_by_av, i; >> struct niter_desc *desc; >> + int limit_factor = 1; >> >> if (!(flags & UAP_UNROLL)) >> { >> @@ -814,10 +885,26 @@ decide_unroll_runtime_iterations (struct loop *loo >> "\n;; Considering unrolling loop with runtime " >> "computable number of iterations\n"); >> >> + if (flag_unroll_codesize_limit) >> + { >> + /* Determine whether to limit code size growth from unrolling, >> + using FDO profile summary information that gives an >> + estimated number of executed blocks. */ >> + limit_factor = code_size_limit_factor (loop); >> + if (dump_file && limit_factor > 1) >> + { >> + fprintf (dump_file, >> + ";; Due to large code size footprint estimate, limit " >> + "max unrolled insns by divisor %d\n", limit_factor); >> + } >> + } >> + >> /* nunroll = total number of copies of the original loop body in >> unrolled loop (i.e. if it is 2, we have to duplicate loop body once. >> */ >> - nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns; >> - nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / >> loop->av_ninsns; >> + nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor >> + / loop->ninsns; >> + nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) >> + / limit_factor / loop->av_ninsns; >> if (nunroll > nunroll_by_av) >> nunroll = nunroll_by_av; >> if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES)) >> @@ -1195,6 +1282,7 @@ decide_peel_simple (struct loop *loop, int flags) >> { >> unsigned npeel; >> struct niter_desc *desc; >> + int limit_factor = 1; >> >> if (!(flags & UAP_PEEL)) >> { >> @@ -1205,8 +1293,23 @@ decide_peel_simple (struct loop *loop, int flags) >> if (dump_file) >> fprintf (dump_file, "\n;; Considering simply peeling loop\n"); >> >> + if (flag_peel_codesize_limit) >> + { >> + /* Determine whether to limit code size growth from peeling, >> + using FDO profile summary information that gives an >> + estimated number of executed blocks. */ >> + limit_factor = code_size_limit_factor (loop); >> + if (dump_file && limit_factor > 1) >> + { >> + fprintf (dump_file, >> + ";; Due to large code size footprint estimate, limit " >> + "max peeled insns by divisor %d\n", limit_factor); >> + } >> + } >> + >> /* npeel = number of iterations to peel. */ >> - npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / loop->ninsns; >> + npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / limit_factor >> + / loop->ninsns; >> if (npeel > (unsigned) PARAM_VALUE (PARAM_MAX_PEEL_TIMES)) >> npeel = PARAM_VALUE (PARAM_MAX_PEEL_TIMES); >> >> @@ -1231,7 +1334,7 @@ decide_peel_simple (struct loop *loop, int flags) >> >> /* Do not simply peel loops with branches inside -- it increases number >> of mispredicts. */ >> - if (num_loop_branches (loop) > 1) >> + if (desc->num_branches > 1) >> { >> if (dump_file) >> fprintf (dump_file, ";; Not peeling, contains branches\n"); >> @@ -1348,6 +1451,7 @@ decide_unroll_stupid (struct loop *loop, int flags >> { >> unsigned nunroll, nunroll_by_av, i; >> struct niter_desc *desc; >> + int limit_factor = 1; >> >> if (!(flags & UAP_UNROLL_ALL)) >> { >> @@ -1358,11 +1462,26 @@ decide_unroll_stupid (struct loop *loop, int flags >> if (dump_file) >> fprintf (dump_file, "\n;; Considering unrolling loop stupidly\n"); >> >> + if (flag_unroll_codesize_limit) >> + { >> + /* Determine whether to limit code size growth from unrolling, >> + using FDO profile summary information that gives an >> + estimated number of executed blocks. */ >> + limit_factor = code_size_limit_factor (loop); >> + if (dump_file && limit_factor > 1) >> + { >> + fprintf (dump_file, >> + ";; Due to large code size footprint estimate, limit " >> + "max unrolled insns by divisor %d\n", limit_factor); >> + } >> + } >> + >> /* nunroll = total number of copies of the original loop body in >> unrolled loop (i.e. if it is 2, we have to duplicate loop body once. >> */ >> - nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns; >> - nunroll_by_av >> - = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns; >> + nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor >> + / loop->ninsns; >> + nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) >> + / limit_factor / loop->av_ninsns; >> if (nunroll > nunroll_by_av) >> nunroll = nunroll_by_av; >> if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES)) >> @@ -1392,7 +1511,7 @@ decide_unroll_stupid (struct loop *loop, int flags >> >> /* Do not unroll loops with branches inside -- it increases number >> of mispredicts. */ >> - if (num_loop_branches (loop) > 1) >> + if (desc->num_branches > 1) >> { >> if (dump_file) >> fprintf (dump_file, ";; Not unrolling, contains branches\n"); >> Index: gcc/coverage.c >> =================================================================== >> --- gcc/coverage.c (revision 189413) >> +++ gcc/coverage.c (working copy) >> @@ -245,6 +245,14 @@ read_counts_file (void) >> gcov_read_summary (&sum); >> for (ix = 0; ix != GCOV_COUNTERS_SUMMABLE; ix++) >> { >> + if (summary.ctrs[ix].num_hot_counters >> + < sum.ctrs[ix].num_hot_counters) >> + { >> + summary.ctrs[ix].num_hot_counters >> + = sum.ctrs[ix].num_hot_counters; >> + summary.ctrs[ix].hot_cutoff_value >> + = sum.ctrs[ix].hot_cutoff_value; >> + } >> summary.ctrs[ix].runs += sum.ctrs[ix].runs; >> summary.ctrs[ix].sum_all += sum.ctrs[ix].sum_all; >> if (summary.ctrs[ix].run_max < sum.ctrs[ix].run_max) >> Index: gcc/loop-iv.c >> =================================================================== >> --- gcc/loop-iv.c (revision 189413) >> +++ gcc/loop-iv.c (working copy) >> @@ -2969,8 +2969,12 @@ get_simple_loop_desc (struct loop *loop) >> /* At least desc->infinite is not always initialized by >> find_simple_loop_exit. */ >> desc = XCNEW (struct niter_desc); >> - iv_analysis_loop_init (loop); >> - find_simple_exit (loop, desc); >> + if (loop->latch != EXIT_BLOCK_PTR) >> + { >> + iv_analysis_loop_init (loop); >> + find_simple_exit (loop, desc); >> + } >> + analyze_loop_insns (loop, desc); >> loop->aux = desc; >> >> if (desc->simple_p && (desc->assumptions || desc->infinite)) >> Index: gcc/common.opt >> =================================================================== >> --- gcc/common.opt (revision 189413) >> +++ gcc/common.opt (working copy) >> @@ -1551,6 +1551,10 @@ fpcc-struct-return >> Common Report Var(flag_pcc_struct_return,1) Init(DEFAULT_PCC_STRUCT_RETURN) >> Return small aggregates in memory, not registers >> >> +fpeel-codesize-limit >> +Common Report Var(flag_peel_codesize_limit) Init(1) Optimization >> +Limit non-const non-FP loop peeling under profile estimates of large code >> footprint >> + >> fpeel-loops >> Common Report Var(flag_peel_loops) Optimization >> Perform loop peeling >> @@ -2108,6 +2112,10 @@ funroll-all-loops >> Common Report Var(flag_unroll_all_loops) Optimization >> Perform loop unrolling for all loops >> >> +funroll-codesize-limit >> +Common Report Var(flag_unroll_codesize_limit) Init(1) Optimization >> +Limit non-const non-FP loop unrolling under profile estimates of large code >> footprint >> + >> ; Nonzero means that loop optimizer may assume that the induction variables >> ; that control loops do not overflow and that the loops with nontrivial >> ; exit condition are not infinite >> Index: gcc/cfgloop.c >> =================================================================== >> --- gcc/cfgloop.c (revision 189413) >> +++ gcc/cfgloop.c (working copy) >> @@ -1155,24 +1155,98 @@ get_loop_exit_edges (const struct loop *loop) >> return edges; >> } >> >> -/* Counts the number of conditional branches inside LOOP. */ >> +/* Determine if INSN is a floating point set. */ >> >> -unsigned >> -num_loop_branches (const struct loop *loop) >> +static bool >> +insn_has_fp_set(rtx insn) >> { >> - unsigned i, n; >> - basic_block * body; >> + int i; >> + rtx pat = PATTERN(insn); >> + if (GET_CODE (pat) == SET) >> + return (FLOAT_MODE_P (GET_MODE (SET_DEST (pat)))); >> + else if (GET_CODE (pat) == PARALLEL) >> + { >> + for (i = 0; i < XVECLEN (pat, 0); i++) >> + { >> + rtx sub = XVECEXP (pat, 0, i); >> + if (GET_CODE (sub) == SET) >> + return (FLOAT_MODE_P (GET_MODE (SET_DEST (sub)))); >> + } >> + } >> + return false; >> +} >> >> - gcc_assert (loop->latch != EXIT_BLOCK_PTR); >> +/* Analyzes the instructions inside LOOP, updating the DESC. Currently >> counts >> + the number of conditional branch instructions, calls and fp instructions, >> + as well as the average number of branches executed per iteration. */ >> >> +void >> +analyze_loop_insns (const struct loop *loop, struct niter_desc *desc) >> +{ >> + unsigned i, nbranch; >> + gcov_type weighted_nbranch; >> + bool has_call, has_fp; >> + basic_block * body, bb; >> + rtx insn; >> + gcov_type header_count = loop->header->count; >> + >> + nbranch = weighted_nbranch = 0; >> + has_call = has_fp = false; >> + >> body = get_loop_body (loop); >> - n = 0; >> for (i = 0; i < loop->num_nodes; i++) >> - if (EDGE_COUNT (body[i]->succs) >= 2) >> - n++; >> + { >> + bb = body[i]; >> + >> + if (EDGE_COUNT (bb->succs) >= 2) >> + { >> + nbranch++; >> + >> + /* If this block is executed less frequently than the header (loop >> + entry), then it is weighted based on its execution count, which >> + will be turned into a ratio compared to the loop header below. >> */ >> + if (bb->count < header_count) >> + weighted_nbranch += bb->count; >> + >> + /* When it is executed more frequently than the header (i.e. it is >> + in a nested inner loop), simply weight the branch the same as >> the >> + header execution count, so that it will contribute 1 branch to >> + the ratio computed below. */ >> + else >> + weighted_nbranch += header_count; >> + } >> + >> + /* No need to iterate through the instructions below if >> + both flags have already been set. */ >> + if (has_call && has_fp) >> + continue; >> + >> + FOR_BB_INSNS (bb, insn) >> + { >> + if (!INSN_P (insn)) >> + continue; >> + >> + if (!has_call) >> + has_call = CALL_P (insn); >> + >> + if (!has_fp) >> + has_fp = insn_has_fp_set (insn); >> + } >> + } >> free (body); >> >> - return n; >> + desc->num_branches = nbranch; >> + /* Now divide the weights computed above by the loop header execution >> count, >> + to compute the average number of branches through the loop. By adding >> + header_count/2 to the numerator we round to nearest with integer >> + division. */ >> + if (header_count != 0) >> + desc->av_num_branches >> + = (weighted_nbranch + header_count/2) / header_count; >> + else >> + desc->av_num_branches = 0; >> + desc->has_call = has_call; >> + desc->has_fp = has_fp; >> } >> >> /* Adds basic block BB to LOOP. */ >> Index: gcc/cfgloop.h >> =================================================================== >> --- gcc/cfgloop.h (revision 189413) >> +++ gcc/cfgloop.h (working copy) >> @@ -255,7 +255,6 @@ extern basic_block *get_loop_body_in_custom_order >> >> extern VEC (edge, heap) *get_loop_exit_edges (const struct loop *); >> edge single_exit (const struct loop *); >> -extern unsigned num_loop_branches (const struct loop *); >> >> extern edge loop_preheader_edge (const struct loop *); >> extern edge loop_latch_edge (const struct loop *); >> @@ -366,7 +365,8 @@ struct rtx_iv >> }; >> >> /* The description of an exit from the loop and of the number of iterations >> - till we take the exit. */ >> + till we take the exit. Also includes other information used primarily >> + by the loop unroller. */ >> >> struct niter_desc >> { >> @@ -407,6 +407,18 @@ struct niter_desc >> >> /* The number of iterations of the loop. */ >> rtx niter_expr; >> + >> + /* The number of branches in the loop. */ >> + unsigned num_branches; >> + >> + /* The number of executed branches per iteration. */ >> + unsigned av_num_branches; >> + >> + /* Whether the loop contains a call instruction. */ >> + bool has_call; >> + >> + /* Whether the loop contains fp instructions. */ >> + bool has_fp; >> }; >> >> extern void iv_analysis_loop_init (struct loop *); >> @@ -420,6 +432,7 @@ extern void iv_analysis_done (void); >> >> extern struct niter_desc *get_simple_loop_desc (struct loop *loop); >> extern void free_simple_loop_desc (struct loop *loop); >> +void analyze_loop_insns (const struct loop *, struct niter_desc *desc); >> >> static inline struct niter_desc * >> simple_loop_desc (struct loop *loop) >> Index: gcc/params.def >> =================================================================== >> --- gcc/params.def (revision 189413) >> +++ gcc/params.def (working copy) >> @@ -311,6 +311,23 @@ DEFPARAM(PARAM_MAX_UNROLL_ITERATIONS, >> "max-completely-peel-loop-nest-depth", >> "The maximum depth of a loop nest we completely peel", >> 8, 0, 0) >> +/* The maximum code size estimate under which loop unrolling and peeling >> + * is allowed in a profile feedback compile. This currently applies to loops >> + * with non-constant iteration counts and no floating point computations. >> */ >> +DEFPARAM(PARAM_UNROLLPEEL_CODESIZE_THRESHOLD, >> + "unrollpeel-codesize-threshold", >> + "Maximum profile-based code size footprint estimate for loop >> unrolling " >> + "and peeling", >> + 15000, 0, 0) >> +/* The maximum ratio of total profiled execution counts to loop entry block >> + count that must be exceeded to ignore most code size limits when >> unrolling >> + and peeling. */ >> +DEFPARAM(PARAM_UNROLLPEEL_HOTNESS_THRESHOLD, >> + "unrollpeel-hotness-threshold", >> + "Maximum ratio of total profiled execution count to loop entry " >> + "block count under which most codesize limits for unrolling and " >> + "peeling will be ignored", >> + 100, 1, 0) >> >> /* The maximum number of insns of an unswitched loop. */ >> DEFPARAM(PARAM_MAX_UNSWITCH_INSNS, >> Index: gcc/gcov-dump.c >> =================================================================== >> --- gcc/gcov-dump.c (revision 189413) >> +++ gcc/gcov-dump.c (working copy) >> @@ -456,8 +456,12 @@ tag_summary (const char *filename ATTRIBUTE_UNUSED >> { >> printf ("\n"); >> print_prefix (filename, 0, 0); >> - printf ("\t\tcounts=%u, runs=%u", >> - summary.ctrs[ix].num, summary.ctrs[ix].runs); >> + printf ("\t\tcounts=%u (num hot counts=%u, hot cutoff count=" >> + HOST_WIDEST_INT_PRINT_DEC "), runs=%u", >> + summary.ctrs[ix].num, >> + summary.ctrs[ix].num_hot_counters, >> + (HOST_WIDEST_INT)summary.ctrs[ix].hot_cutoff_value, >> + summary.ctrs[ix].runs); >> >> printf (", sum_all=" HOST_WIDEST_INT_PRINT_DEC, >> (HOST_WIDEST_INT)summary.ctrs[ix].sum_all); >> >> -- >> This patch is available for review at http://codereview.appspot.com/6351086 > > > > -- > Teresa Johnson | Software Engineer | tejohn...@google.com | 408-460-2413 -- Teresa Johnson | Software Engineer | tejohn...@google.com | 408-460-2413