Hi,
> + Honza
> 
> This patch may be a candidate for trunk as well. This feature not only
> allows profile collection with much less overhead (for multi-thread
> programs with hot regions, the slow down can be significant due to
> cache ping-pong effect of counter update) without sacrificing too much
> the performance.

Hmm, it is interesting idea.  
> 
> A known limitation is that value profiling is not yet sampled, but it
> does not seem to cause problems.

For the performance alone, we probably don't need to care that much
given the fact that we value porfile only relatively expensive operations.
But if we want to have the turn off/on feature, then i gueess we need to
guard everything.  It is not much of pain to add the code generating
conditionals everywhere, after all.
> > +@item -fprofile-generate-sampling
> > +@opindex -fprofile-generate-sampling
> > +
> > +Enable sampling for instrumented binaries.  Instead of recording every 
> > event,
> > +record only every N-th event, where N (the sampling rate) can be set either
> > +at compile time using
> > +@option{--param profile-generate-sampling-rate=@var{value}}, or
> > +at execution start time through environment variable 
> > @samp{GCOV_SAMPLING_RATE}.
> > +
> > +At this time sampling applies only to branch counters.  A sampling rate of 
> > 100
> > +decreases instrumentated binary slowdown from up to 20x for heavily 
> > threaded
> > +applications down to around 2x.  @option{-fprofile-correction} is always
> > +needed with sampling.

You probably want to make -fprofile-correction to be implied in this case. 
Perhaps even drop it into the gcov infos so we don't require -fprofile-generate
and -fprofile-use values to match here.

Also --param is generally intended for GCC developers, I would go with
-fprofile-generate-sampling=num.

Note that other way of getting sampling cost down is probably to apply the
estimated profile to minimal spanning tree construction.  It should put the 
counters
on less executed paths instead of often executed and at mainline we now 
construct
estimated profile before profiling (this is kind of ordering problem as for the
statistics we need to do otherwise, but I would be happy with extra statistics
pass that would be enabled only for my collecting script).

Option is also to have counters in automatic vars and flush them upon function
return (perhaps with the sampling rate trick and optional locking for full
thread safety)...  It might end up cheaper than adding many ifs everywhere.

> > Index: gcc/profile.h
> > ===================================================================
> > --- gcc/profile.h       (revision 173136)
> > +++ gcc/profile.h       (working copy)
> > @@ -47,4 +47,10 @@ extern gcov_type sum_edge_counts (VEC (edge, gc) *
> >  extern void init_node_map (void);
> >  extern void del_node_map (void);
> >
> > +/* Implement sampling to avoid writing to edge counters very often.
> > +   Many concurrent writes to the same counters, or to counters that share
> > +   the same cache line leads to up to 30x slowdown on an application 
> > running
> > +   on 8 CPUs.  With sampling, the slowdown reduced to 2x.  */
> > +extern void add_sampling_to_edge_counters (void);

The comment belong to place function is implemented.
> >
> > +fprofile-generate-sampling
> > +Common Var(flag_profile_generate_sampling)
> > +Turn on instrumentation sampling with -fprofile-generate with rate set by 
> > --param profile-generate-sampling-rate or environment variable 
> > GCOV_SAMPLING_RATE

I think -fprofile-generate and friends need some @ shuggar.  Probably @option.
> > Index: gcc/tree-profile.c
> > ===================================================================
> > --- gcc/tree-profile.c  (revision 173136)
> > +++ gcc/tree-profile.c  (working copy)
> > @@ -31,6 +31,8 @@ along with GCC; see the file COPYING3.  If not see
> >  #include "coretypes.h"
> >  #include "tm.h"
> >  #include "flags.h"
> > +#include "target.h"
> > +#include "output.h"
> >  #include "regs.h"
> >  #include "function.h"
> >  #include "basic-block.h"
> > @@ -44,9 +46,14 @@ along with GCC; see the file COPYING3.  If not see
> >  #include "value-prof.h"
> >  #include "cgraph.h"
> >  #include "output.h"
> > +#include "params.h"
> > +#include "profile.h"

For mainline, don't forget about dependencies.  Do we really need all of this?
> > +/* Insert STMT_IF around given sequence of consecutive statements in the
> > +   same basic block starting with STMT_START, ending with STMT_END.  */
> > +
> > +static void
> > +insert_if_then (gimple stmt_start, gimple stmt_end, gimple stmt_if)
> > +{
> > +  gimple_stmt_iterator gsi;
> > +  basic_block bb_original, bb_before_if, bb_after_if;
> > +  edge e_if_taken, e_then_join;
> > +
> > +  gsi = gsi_for_stmt (stmt_start);
> > +  gsi_insert_before (&gsi, stmt_if, GSI_SAME_STMT);
> > +  bb_original = gsi_bb (gsi);
> > +  e_if_taken = split_block (bb_original, stmt_if);
> > +  e_if_taken->flags &= ~EDGE_FALLTHRU;
> > +  e_if_taken->flags |= EDGE_TRUE_VALUE;
> > +  e_then_join = split_block (e_if_taken->dest, stmt_end);
> > +  bb_before_if = e_if_taken->src;
> > +  bb_after_if = e_then_join->dest;

On mainline when we do profile estimation before profiling instrumentaiton, now,
you really want to update profile for performance here.
> > +  make_edge (bb_before_if, bb_after_if, EDGE_FALSE_VALUE);
> > +}
> > +
> > +/* Transform:
> > +
> > +   ORIGINAL CODE
> > +
> > +   Into:
> > +
> > +   __gcov_sample_counter++;
> > +   if (__gcov_sample_counter >= __gcov_sampling_rate)
> > +     {
> > +       __gcov_sample_counter = 0;
> > +       ORIGINAL CODE
> > +     }

Hmm, I think the probability that internal loop of program will interfere with
sampling rate is relatively high, but I see it is bit hard to do something about
it.  Can we think of some very basic randomization of sampling_rate?

Honza

Reply via email to