Re: Compute precise counter histogram at LTO

Teresa Johnson Wed, 24 Apr 2013 06:38:00 -0700

On Mon, Apr 22, 2013 at 11:16 AM, Jan Hubicka <hubi...@ucw.cz> wrote:
> Hi,
> sorry for getting back to this late.
>> >> That's a larger error than I had expected from the merging, although
>> >> as you note it is an approximation so there is going to be some amount
>> >> of error. If the error is that large then maybe there is a better
>> >> merging algorithm, since in the non-LTO case we won't be able to
>> >> recompute it exactly. For cc1, what was your test case -
>> >> profiledbootstrap or something simpler? I can try to reproduce this
>> >> and see if it is another bug or just due to the approximation.
>>
>> I've been using Rong's tool to compute the exactly merged histogram
>> from the gcov-merged histogram for perlbmk. I tried a couple test
>> cases - with the 5 train inputs, and with the 3 ref inputs. In both
>> cases I am seeing up to 200% or so difference in some of the working
>> set min counter values, although the error is not as high for the
>> higher working set percentages. But large enough to cause a potential
>> performance issue nonetheless.
>
> One thing that confuse me is why the error tends to be in positive direction.
> Since we are minimizing the counter during merging, I would expect us to
> more or less consistently underestimate the counters.  Do you have any
> intuition here?


Hi Honza,

Yes, I think I know what is going on. What to do about it is a
different story. =)

>From comparing the histograms merged by libgcov to the exactly merged
histograms produced by Rong's tool at larger histogram sizes, I think
the main issue is that the hotspots are slightly different between
different runs, and this causes inaccuracies in the merged histogram
unless you have global counter information.

I tried increasing the histogram up to a much larger size - 16128.
This is 256 linear subranges per bit instead of the default 4. This
increased the accuracy at the high end of the profile range, but I was
still left with a fair amount of inaccuracy in the working set (except
the 99.9% bucket, which is very close now).

I confirmed that some of the hottest counters between different
training inputs (of perlbmk) are from different functions. The hottest
2 counters are the same between all 3 runs, but after that there are
some differences. When the hotspots are different, those large counter
values actually correspond to smaller counter values in the other run.
Therefore, at the high end of the counter range, the gcov-merged
counter values are artificially high, because we are summing large
counter values that shouldn't be correlated. Similarly, some of the
smaller counter values are being summed together that shouldn't be,
producing artificially low merged counters. For example, in the
libgcov-merged profile, the number of 0 valued counters will be the
min of the number of 0-valued counters in the individual runs. But in
the exactly-merged profile, there are fewer 0-valued counters, because
some of the 0 counters from individual runs were non-zero in other
runs.

I confirmed by comparing graphs of the counter values produced by each
merge that there are more very high and very low counter values in the
libgcov-merged profile. Graphing the counter values accumulated from
highest to lowest (similar to what the working set computation does),
shows that the accumulated sum grows faster in the libgcov-merged case
as a result, and this is causing the various working set cutoff values
to be hit earlier, resulting in higher min counter values.

I haven't come up with any great ideas on how to handle this issue
though. Possibly artificially reduce the min counter value for the
working sets when there were more than one run?

>
> Also if you have setup with your tool, it may be nice to double check that
> the histograms produced by the LTO pass actually match the histogram produced
> by the Ron's external tool.  I am not sure if his tool takes into account the
> estimated execution times of basic blocks.  If not it may be interesting
> experiment by itself, since we will get how well counting counts alone 
> estimate
> the times. (I would expect it to be rather good, but it is always better
> to sanity check).

Rong sent his tool for review for the gcc google 4_7 branch for now,
but it is usable for trunk too. See:
http://gcc.gnu.org/ml/gcc-patches/2012-11/msg02141.html

It doesn't take any estimated times into account - it just merges the
counters and recomputes the histogram based on all the counters. We
could take an exactly merged profile produced by his tool and feed it
into LTO with your patch and get the time estimate comparison your
patch dumps out. But I would expect it to be the same since like LTO
it has the same global set of counter input?

Teresa

>>
>> It looks like the histogram min counters are always off in the larger
>> direction with the gcov merged histogram, so one possibility is to
>> reduce the min counter value by some divisor when the number of runs
>> is >1. Unfortunately, at least in the perlbmk case, the magnitude of
>> the error doesn't seem to correlate obviously with the # runs (in the
>> 5 run case the error is actually a little less than in the 3 run case
>> for perlbench). Honza, how many runs were merged in your 10x error
>> case above?
>
> It was the standard profiledbootstrap. I am not sure how many runs exactly
> are merged, but it will be couple hundred. There is also an issue with fact
> that libbackend is linked into more than one binary.
>>
>> One thing I noticed in the perlbmk case was that there were a number
>> of large counter values sharing histogram buckets at the high end of
>> the histogram in some of the individual run profiles, so there would
>> be a large loss of precision when merging, since the range of counter
>> values sharing a single histogram is larger at the high end of the
>> histogram. I'll experiment with increasing the size of the histogram
>> to see how much that would reduce the error.
>
> Thanks! I guess increasing histogram to size matching approximately the
> number of counters in the binary should more or less eliminate the precision
> errors.
>
> Honza



--
Teresa Johnson | Software Engineer | tejohn...@google.com | 408-460-2413

Re: Compute precise counter histogram at LTO

Reply via email to