Re: Fwd: [PATCH] New fdo summary-based icache sensitive unrolling (issue6351086)

Teresa Johnson Fri, 27 Jul 2012 07:57:29 -0700

On Fri, Jul 27, 2012 at 3:14 AM, Jan Hubicka <hubi...@ucw.cz> wrote:
> Hi,
>> Index: libgcc/libgcov.c
>> ===================================================================
>> --- libgcc/libgcov.c    (revision 189893)
>> +++ libgcc/libgcov.c    (working copy)
>> @@ -276,6 +276,120 @@ gcov_version (struct gcov_info *ptr, gcov_unsigned
>>    return 1;
>>  }
>>
>> +/* Used by qsort to sort gcov values in descending order.  */
>> +
>> +static int
>> +sort_by_reverse_gcov_value (const void *pa, const void *pb)
>> +{
>> +  const gcov_type a = *(gcov_type const *)pa;
>> +  const gcov_type b = *(gcov_type const *)pb;
>> +
>> +  if (b > a)
>> +    return 1;
>> +  else if (b == a)
>> +    return 0;
>> +  else
>> +    return -1;
>> +}
>> +
>> +/* Determines the number of counters required to cover a given percentage
>> +   of the total sum of execution counts in the summary, which is then also
>> +   recorded in SUM.  */
>> +
>> +static void
>> +gcov_compute_cutoff_values (struct gcov_summary *sum)
>
> This looks like good idea to me to drive the hot/cold partitioning even if it
> is not quite accurate (you have no idea how many instructions given counter is
> guarding).


Thanks - right, I think it will be a good approximation.

>
> To reduce overhead on embedded sysems, what about just doing histogram with 
> say
> 128 steps instead if dragging in qsort? This also avoids the need to
> produce the copy of all counters.

I like that suggestion. I'll need 1024 buckets though to get the 99.9% cutoff
I am using right now (or use 128 buckets plus 1 to track what would be
bucket 1024 which is roughly 99.9%). I can use something like a binary
search to locate the right bucket without a linear search or divide. And
I can keep track of the min value for each bucket in order to identify the
correct value for hot_cutoff_value.

>> +
>> +  /* Determine the cumulative counter value at the specified cutoff
>> +     percentage and record the percentage for use by gcov consumers.
>> +     Check for overflow when sum_all is multiplied by the cutoff_perc,
>> +     and if so, do the divide first.  */
>> +  if ((cs_ptr->sum_all * cutoff_perc) / cutoff_perc != cs_ptr->sum_all)
>> +    /* Overflow, do the divide first.  */
>> +    cum_cutoff = cs_ptr->sum_all / 1000 * cutoff_perc;
>> +  else
>> +    /* Otherwise multiply first to get the correct value for small
>> +       values of sum_all.  */
>> +    cum_cutoff = (cs_ptr->sum_all * cutoff_perc) / 1000;
>
> To further keep embedded systems (at least a bit) happier, I guess one could 
> do
> this without generic 64bit divide operations.  I guess 1000 can be bumped up 
> to
> 1024, small error is hamless here.
>
> Actually it may be easier to simply embedd the histogram into gcov summary
> so one can control the cutoff with --param in compiler at --profile-use time.
> It seems resonable to me to trade 128 values per file for the extra 
> flexibility.

Both you and David have requested more values, so I will go ahead
and implement this suggestion in this patch. I can use the 128 + 1
bucket approach I described above to get the data for roughly every
1% plus the 99.9%. That should be enough granularity for the
optimizations (and the smallest bucket doesn't really need to be
fed back as it can be largely extrapolated from the others). This
will require feeding back 128 arrays of 2 values (num_hot_counters
and hot_cutoff_value).

>
>> +  for (gi_ptr = gcov_list; gi_ptr; gi_ptr = gi_ptr->next)
>> +    {
>> +      if (!gi_ptr->merge[t_ix])
>> +        continue;
>> +
>> +      /* Find the appropriate index into the gcov_ctr_info array
>> +         for the counter we are currently working on based on the
>> +         existence of the merge function pointer for this object.  */
>> +      for (i = 0, ctr_info_ix = 0; i < t_ix; i++)
>> +        {
>> +          if (gi_ptr->merge[i])
>> +            ctr_info_ix++;
>> +        }
>> +      for (f_ix = 0; f_ix != gi_ptr->n_functions; f_ix++)
>> +        {
>> +          gfi_ptr = gi_ptr->functions[f_ix];
>> +
>> +          if (!gfi_ptr || gfi_ptr->key != gi_ptr)
>> +            continue;
>> +
>> +          ci_ptr = &gfi_ptr->ctrs[ctr_info_ix];
>> +          /* Sanity check that there are enough entries in value_arry
>> +            for this function's counters. Gracefully handle the case when
>> +            there are not, in case something in the profile info is
>> +            corrupted.  */
>> +          c_num = ci_ptr->num;
>> +          if (index + c_num > cs_ptr->num)
>> +            c_num = cs_ptr->num - index;
>> +          /* Copy over this function's counter values.  */
>> +          memcpy (&value_array[index], ci_ptr->values,
>> +                  sizeof (gcov_type) * c_num);
>> +          index += c_num;
>
> I wonder if the loop walking all counters can't be fused into one of the other
> loops we already have.

Not with the histogram approach as the preceding walk (in the caller, gcov_exit)
will be needed to find the min and max counter values first.

>> +        }
>> Index: gcc/doc/invoke.texi
>> ===================================================================
>> --- gcc/doc/invoke.texi (revision 189893)
>> +++ gcc/doc/invoke.texi (working copy)
>> @@ -385,7 +385,7 @@ Objective-C and Objective-C++ Dialects}.
>>  -fno-sched-interblock -fno-sched-spec -fno-signed-zeros @gol
>>  -fno-toplevel-reorder -fno-trapping-math -fno-zero-initialized-in-bss @gol
>>  -fomit-frame-pointer -foptimize-register-move -foptimize-sibling-calls @gol
>> --fpartial-inlining -fpeel-loops -fpredictive-commoning @gol
>> +-fpartial-inlining -fpeel-codesize-limit -fpeel-loops
>> -fpredictive-commoning @gol
>>  -fprefetch-loop-arrays @gol
>>  -fprofile-correction -fprofile-dir=@var{path} -fprofile-generate @gol
>>  -fprofile-generate=@var{path} @gol
>> @@ -417,7 +417,7 @@ Objective-C and Objective-C++ Dialects}.
>>  -ftree-reassoc -ftree-sink -ftree-slsr -ftree-sra @gol
>>  -ftree-switch-conversion -ftree-tail-merge @gol
>>  -ftree-ter -ftree-vect-loop-version -ftree-vectorize -ftree-vrp @gol
>> --funit-at-a-time -funroll-all-loops -funroll-loops @gol
>> +-funit-at-a-time -funroll-all-loops -funroll-loops -funroll-codesize-limit 
>> @gol
>>  -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops 
>> @gol
>>  -fvariable-expansion-in-unroller -fvect-cost-model -fvpt -fweb @gol
>>  -fwhole-program -fwpa -fuse-linker-plugin @gol
>> @@ -8527,6 +8527,14 @@ the loop is entered.  This usually makes programs
>>  @option{-funroll-all-loops} implies the same options as
>>  @option{-funroll-loops}.
>>
>> +@item -funroll-codesize-limit
>> +@opindex funroll-codesize-limit
>> +Limit loop unrolling of non-const non-FP loops in a profile feedback
>> compilation
>> +under estimates of a large code footprint. Enabled by default with
>> +@option{-fprofile-use}. Code size and execution weight thresholds are
>> controlled
>> +by the @option{unrollpeel-codesize-threshold} and
>> +@option{unrollpeel-hotness-threshold} parameters.
>
> Lets handle the cutoff logic independently of the loop bits I can not approve.
> The patch is OK with the change ssuggested

Ok, I will split this into 2 patches, one for the gcov changes only, and a
follow-on patch with my loop unroller changes.

Thanks!
Teresa

>
> Honza



-- 
Teresa Johnson | Software Engineer | tejohn...@google.com | 408-460-2413

Re: Fwd: [PATCH] New fdo summary-based icache sensitive unrolling (issue6351086)

Reply via email to