On Tue, Aug 21, 2012 at 6:56 PM, Jan Hubicka <hubi...@ucw.cz> wrote:
>> > I can go ahead with the histogram approach. There is some roundoff
>> > error from the working set scaling approach that can affect different
>> > merging orders as you note, although I think this only really affects the
>> > small counter values. The other place where saving/merging the histogram
>
> Do you have any intuition on why simple maximalization merging (that is safe 
> wrt
> ordering) would be bad idea?

When you say "maximalization merging" are you talking about the
histogram merging approach I mentioned a few emails back (my response
on Aug 19) where we assume the same relative order of hotness in the
counters between runs, and accumulate the counter values in the
histogram in that order?

This would be inaccurate if different runs exercised different areas
of the code, and thus the counters would be ordered in the histogram
differently.

>
> We care only about working set size around top of the histogram and I would 
> say

For optimizations that care about the boundary between hot and cold
such as code layout I think we will also care about the smaller values
in the histogram (to have a good idea of what constitutes a cold block
counter value).

> that we should sort of optimize for the largest (in the number of blocks in 
> hot
> area) of the train runs.  One way where things will get messed up is when the
> working set is about the same but the runs are of different size so all the
> blocks gets accounted into two different buckets.

I'm not sure I understand the last sentence - is this something that
would not get handled by merging the histogram entries as I described
earlier? Or it sounds like you might have a different merging approach
in mind?

>
> But in general I do not think there is resonably accurate way to merge the
> profiles without actually streaming out all counter IDs in every bucket, so 
> perhaps
> this will work well enough.  If not, we can in future introduce per-program 
> global
> summary file that will contain the counters to be merged acurately.

Sounds good.

>
>> > would help is when the distribution of counter values in the histogram
>> > varies between runs, say for example, the hottest loop is much hotter in a
>> > subsequent run, but the rest of the counter values stay largely consistent.
>> > Note, however, that if the hotspots are different in different runs, then
>> > merging either the histogram or the working set will have issues. The only
>> > way to detect this is to recompute the histogram/working set from scratch
>> > from the merged counters.
>> >
>> > I wonder in practice, even when there are a lot of simultaneous runs going
>> > like in a gcc bootstrap, if we could get reasonably accurate summary
>> > recomputation without global locking. The reason is that as long as the
>> > actual counter updates are safe as they are now due to the individual file
>> > locking, the inaccuracy in the recomputed summary information will not grow
>> > over time, and there is a reasonable chance that the last run to complete
>> > and merge will recompute the summary from the final merged counter values
>> > and get it right (or only be off by a little bit if there are a couple of
>> > runs completing simultaneously at the end). But this can be investigated as
>> > a follow on to this patch.
>> >
>>
>>
>> The main concern is probably the build reproducibility in gcc bootstrap
>> with FDO.
>
> Hmm, you mean in the first pass update every file with new counters and in 
> the second
> pass just update the summaries?

Right, that's what I had in mind (what you have described in #2 below).

> OK, so I guess we went through
>  1) two pass updating with race in between pases.
>  2) two pass updating with first pass updating countes and second having race 
> only for summary update.
>     (i.e. no races for counters)
>  3) two pass updating with flocking (and some way to handle detected 
> deadlocks)
>  4) one pass updating with histogram merging + maximalization of working set.
>     (we do not realy need to scale the buckets, we can simply merge the 
> histograms
>      and then mutiply them by nruns before comparing to actual counters.

By merging the histograms (and accumulating the counter values stored
there as we merge), I don't think we need to multiply the counter
values by nruns, do we?

>      This assumes that working sets of all runs are about the same, but 
> should work
>      resonably in practice I think.
>
> I guess 3/4 are acceptable WRT bootstrap reproducibility.
>
> I have no experience with flocking large number of files and portability of
> this solution i.e.  to Windows.  If you think that 2) would be too inaccurate
> in practice and 3) has chance to be portable, we could go for this.  It will
> solve the precision problems and will also work for LIPO summaries.
> I would be curious about effect on profiledbootstrap time of this if you 
> implement
> it.

I'm hoping that 2) will be accurate enough in practice, but it will
need some investigation.

Thanks,
Teresa

>
> Honza
>>
>> David
>>
>>
>>
>> >
>> > Thanks,
>> > Teresa
>> >
>> > >
>> >> > >>
>> >> > >>
>> >> > >> >  2) Do we plan to add some features in near future that will
>> >> anyway require global locking?
>> >> > >> >     I guess LIPO itself does not count since it streams its data
>> >> into independent file as you
>> >> > >> >     mentioned earlier and locking LIPO file is not that hard.
>> >> > >> >     Does LIPO stream everything into that common file, or does it
>> >> use combination of gcda files
>> >> > >> >     and common summary?
>> >> > >>
>> >> > >> Actually, LIPO module grouping information are stored in gcda files.
>> >> > >> It is also stored in a separate .imports file (one per object) ---
>> >> > >> this is primarily used by our build system for dependence
>> >> information.
>> >> > >
>> >> > > I see, getting LIPO safe WRT parallel updates will be fun. How does
>> >> LIPO behave
>> >> > > on GCC bootstrap?
>> >> >
>> >> > We have not tried gcc bootstrap with LIPO. Gcc compile time is not the
>> >> > main problem for application build -- the link time (for debug build)
>> >> > is.
>> >>
>> >> I was primarily curious how the LIPOs runtime analysis fare in the
>> >> situation where
>> >> you do very many small train runs on rather large app (sure GCC is small
>> >> to google's
>> >> use case ;).
>> >> >
>> >> > > (i.e. it does a lot more work in the libgcov module per each
>> >> > > invocation, so I am curious if it is practically useful at all).
>> >> > >
>> >> > > With LTO based solution a lot can be probably pushed at link time?
>> >> Before
>> >> > > actual GCC starts from the linker plugin, LIPO module can read gcov
>> >> CFGs from
>> >> > > gcda files and do all the merging/updating/CFG constructions that is
>> >> currently
>> >> > > performed at runtime, right?
>> >> >
>> >> > The dynamic cgraph build and analysis is still done at runtime.
>> >> > However, with the new implementation, FE is no longer involved. Gcc
>> >> > driver is modified to understand module grouping, and lto is used to
>> >> > merge the streamed output from aux modules.
>> >>
>> >> I see. Are there any fundamental reasons why it can not be done at
>> >> link-time
>> >> when all gcda files are available? Why the grouping is not done inside
>> >> linker
>> >> plugin?
>> >>
>> >> Honza
>> >> >
>> >> >
>> >> > David
>> >>
>> >
>> >
>> >
>> > --
>> > Teresa Johnson | Software Engineer |  tejohn...@google.com |  408-460-2413
>> >



-- 
Teresa Johnson | Software Engineer | tejohn...@google.com | 408-460-2413

Reply via email to