Re: [PATCH] AutoFDO patch for trunk

2014-10-22 Thread Rainer Orth
Dehao Chen de...@google.com writes:

 The updated patch attached. Will commit the patch in 2~3 hours if no
 objection is received.

Apart from the AIX bootstrap failure your patch introduced, it also
breaks Solaris bootstrap:

In file included from ./config.h:6:0,
 from /vol/gcc/src/hg/trunk/local/gcc/auto-profile.c:25:
./auto-host.h:1976:0: error: _FILE_OFFSET_BITS redefined [-Werror]
 #define _FILE_OFFSET_BITS 64
 ^
In file included from /usr/include/iso/string_iso.h:24:0,
 from /usr/include/string.h:11,
 from /vol/gcc/src/hg/trunk/local/gcc/auto-profile.c:21:
/var/gcc/regression/trunk/11-gcc/build/prev-gcc/include-fixed/sys/feature_tests.h:213:0:
 note: this is the location of the previous definition
 #define _FILE_OFFSET_BITS 32
 ^

As Joseph is repeating over and over again, *nothing* must be included
before config.h, and auto-profile.c violates this.

The following patch at least allows the file to compile without errors;
no idea if this the best order for the headers involved.

diff --git a/gcc/auto-profile.c b/gcc/auto-profile.c
--- a/gcc/auto-profile.c
+++ b/gcc/auto-profile.c
@@ -18,12 +18,13 @@ You should have received a copy of the G
 along with GCC; see the file COPYING3.  If not see
 http://www.gnu.org/licenses/.  */
 
+#include config.h
+#include system.h
+
 #include string.h
 #include map
 #include set
 
-#include config.h
-#include system.h
 #include coretypes.h
 #include tree.h
 #include tree-pass.h

Rainer

-- 
-
Rainer Orth, Center for Biotechnology, Bielefeld University


Re: [PATCH] AutoFDO patch for trunk

2014-10-22 Thread David Edelsohn
 Rainer Orth writes:

 As Joseph is repeating over and over again, *nothing* must be included
 before config.h, and auto-profile.c violates this.

 The following patch at least allows the file to compile without errors;
 no idea if this the best order for the headers involved.


The patch also fixes the problem on AIX.


Thanks, David


Re: [PATCH] AutoFDO patch for trunk

2014-10-22 Thread Xinliang David Li
Can someone pre-approve the patch so that Dehao can check it in after
basic testing?

David

On Wed, Oct 22, 2014 at 10:06 AM, David Edelsohn dje@gmail.com wrote:
 Rainer Orth writes:

 As Joseph is repeating over and over again, *nothing* must be included
 before config.h, and auto-profile.c violates this.

 The following patch at least allows the file to compile without errors;
 no idea if this the best order for the headers involved.


 The patch also fixes the problem on AIX.


 Thanks, David


Re: [PATCH] AutoFDO patch for trunk

2014-10-22 Thread Dehao Chen
The patch tested OK. And I think it's a trivial patch, and already
committed it to trunk.

About the perf parser. I'm syncing the toolchain to head which should
already have newer kernel support.

Thanks,
Dehao

On Wed, Oct 22, 2014 at 10:07 AM, Xinliang David Li davi...@google.com wrote:
 Can someone pre-approve the patch so that Dehao can check it in after
 basic testing?

 David

 On Wed, Oct 22, 2014 at 10:06 AM, David Edelsohn dje@gmail.com wrote:
 Rainer Orth writes:

 As Joseph is repeating over and over again, *nothing* must be included
 before config.h, and auto-profile.c violates this.

 The following patch at least allows the file to compile without errors;
 no idea if this the best order for the headers involved.


 The patch also fixes the problem on AIX.


 Thanks, David


Re: [PATCH] AutoFDO patch for trunk

2014-10-21 Thread Markus Trippelsdorf
On 2014.10.20 at 14:21 -0700, Dehao Chen wrote:
  +If @var{path} is specified, GCC looks at the @var{path} to find
  +the profile feedback data files.
  +
  +In order to collect AutoFDO profile, you need to have:
  +
  +1. A linux system with linux perf support
  +2. An Intel processor with last branch record (LBR) support
 
  Is it really a requirement or will it only make profiles more accurate?
 
 It's optional, updated the doc.

Can you give an example on how this is supposed to work on non Intel
CPUs? It is not obvious to me.

-- 
Markus


Re: [PATCH] AutoFDO patch for trunk

2014-10-21 Thread Dehao Chen
Everything will be the same on non-intel CPUs except for the perf command:

perf record -e instructions -- your program.

i.e. you need to drop -b and use instructions as event.

Note that the current algorithm is tuned for accurate instruction
level profile, which is not available on non-Intel CPU. But you are
more than welcome to tune the propagation algorithm to get most out of
inaccurate instruction profile.

Cheers,
Dehao

On Tue, Oct 21, 2014 at 12:30 PM, Markus Trippelsdorf
mar...@trippelsdorf.de wrote:
 On 2014.10.20 at 14:21 -0700, Dehao Chen wrote:
  +If @var{path} is specified, GCC looks at the @var{path} to find
  +the profile feedback data files.
  +
  +In order to collect AutoFDO profile, you need to have:
  +
  +1. A linux system with linux perf support
  +2. An Intel processor with last branch record (LBR) support
 
  Is it really a requirement or will it only make profiles more accurate?

 It's optional, updated the doc.

 Can you give an example on how this is supposed to work on non Intel
 CPUs? It is not obvious to me.

 --
 Markus


Re: [PATCH] AutoFDO patch for trunk

2014-10-21 Thread Markus Trippelsdorf
On 2014.10.21 at 13:53 -0700, Dehao Chen wrote:
 Everything will be the same on non-intel CPUs except for the perf command:
 
 perf record -e instructions -- your program.
 
 i.e. you need to drop -b and use instructions as event.
 
 Note that the current algorithm is tuned for accurate instruction
 level profile, which is not available on non-Intel CPU. But you are
 more than welcome to tune the propagation algorithm to get most out of
 inaccurate instruction profile.

Thanks.
But unfortunately it doesn't work:

markus@x4 autofdo % g++ -O2 -g /home/markus/bench.cpp
markus@x4 autofdo % perf record -e instructions  ./a.out
...
markus@x4 autofdo % ./create_gcov --binary=a.out --profile=perf.data 
--gcov=profile.afdo
E1021 23:23:30.733412 13689 perf_reader.cc:918] Unsupported event type 10
F1021 23:23:30.733707 13689 perf_parser.cc:114] Check failed: 
ReadPerfSampleInfo(**parsed_events_[i].raw_event, sample_info) 
*** Check failure stack trace: ***
@   0x44469d  google::LogMessage::Flush()
@   0x4470b9  google::LogMessageFatal::~LogMessageFatal()
@   0x43c668  quipper::PerfParser::SortParsedEvents()
@   0x43f0cb  quipper::PerfParser::ParseRawEvents()
@   0x4221ef  autofdo::PerfDataSampleReader::Append()
@   0x421af3  autofdo::SampleReader::ReadAndSetMaxCount()
@   0x417e61  autofdo::ProfileCreator::ReadSample()
@   0x418db2  autofdo::ProfileCreator::CreateProfile()
@   0x4086a5  main
@ 0x7fa849ce8fd0  __libc_start_main
@   0x4099cb  (unknown)
@  (nil)  (unknown)
[1]13689 abort  ./create_gcov --binary=a.out --profile=perf.data 
--gcov=profile.afdo

-- 
Markus


Re: [PATCH] AutoFDO patch for trunk

2014-10-21 Thread Dehao Chen
Looks like the perf data type is incompatible with quipper (perf data
parser). Can you send me the perf.data file so that I can take a look.

Thanks,
Dehao

On Tue, Oct 21, 2014 at 2:25 PM, Markus Trippelsdorf
mar...@trippelsdorf.de wrote:
 On 2014.10.21 at 13:53 -0700, Dehao Chen wrote:
 Everything will be the same on non-intel CPUs except for the perf command:

 perf record -e instructions -- your program.

 i.e. you need to drop -b and use instructions as event.

 Note that the current algorithm is tuned for accurate instruction
 level profile, which is not available on non-Intel CPU. But you are
 more than welcome to tune the propagation algorithm to get most out of
 inaccurate instruction profile.

 Thanks.
 But unfortunately it doesn't work:

 markus@x4 autofdo % g++ -O2 -g /home/markus/bench.cpp
 markus@x4 autofdo % perf record -e instructions  ./a.out
 ...
 markus@x4 autofdo % ./create_gcov --binary=a.out --profile=perf.data 
 --gcov=profile.afdo
 E1021 23:23:30.733412 13689 perf_reader.cc:918] Unsupported event type 10
 F1021 23:23:30.733707 13689 perf_parser.cc:114] Check failed: 
 ReadPerfSampleInfo(**parsed_events_[i].raw_event, sample_info)
 *** Check failure stack trace: ***
 @   0x44469d  google::LogMessage::Flush()
 @   0x4470b9  google::LogMessageFatal::~LogMessageFatal()
 @   0x43c668  quipper::PerfParser::SortParsedEvents()
 @   0x43f0cb  quipper::PerfParser::ParseRawEvents()
 @   0x4221ef  autofdo::PerfDataSampleReader::Append()
 @   0x421af3  autofdo::SampleReader::ReadAndSetMaxCount()
 @   0x417e61  autofdo::ProfileCreator::ReadSample()
 @   0x418db2  autofdo::ProfileCreator::CreateProfile()
 @   0x4086a5  main
 @ 0x7fa849ce8fd0  __libc_start_main
 @   0x4099cb  (unknown)
 @  (nil)  (unknown)
 [1]13689 abort  ./create_gcov --binary=a.out --profile=perf.data 
 --gcov=profile.afdo

 --
 Markus


Re: [PATCH] AutoFDO patch for trunk

2014-10-21 Thread Markus Trippelsdorf
On 2014.10.21 at 15:31 -0700, Dehao Chen wrote:
 Looks like the perf data type is incompatible with quipper (perf data
 parser). Can you send me the perf.data file so that I can take a look.

PERF_RECORD_MMAP2 (aka 10) was added in Linux 3.12 (commit 13d7a2410f).
So your autofdo tool simply doesn't work on newer kernels.

-- 
Markus


Re: [PATCH] AutoFDO patch for trunk

2014-10-20 Thread Dehao Chen
The updated patch attached. Will commit the patch in 2~3 hours if no
objection is received.

Thanks,
Dehao

On Sun, Oct 19, 2014 at 2:58 AM, Jan Hubicka hubi...@ucw.cz wrote:
  +/* Member functions for string_table.  */
  +
  +string_table *
  +string_table::create ()
 
  Why this is not a constructor?

 We use static initializer because it's not suggested to put too much
 logic in constructor.

 Why not? :)

You're right. Updated to use constructor instead.

  +}
 
  The two hunks probably can be unified.
  Why get_index_by_decl does not already use dwarf name and why do you need 
  to consider both?

 get_index_by_decl is actually a wrapper of get_index. It tolerates
 error when assembler name is not emitted in the debug binary (mostly
 for functions that are fully inlined). In this case, we will first try
 to find the index of the assembler name, if not found, then bfd name.
 If still not found, then we try to find name from its abstract
 location.

 Yep, I am just somewhat concerned that you need to try different variations 
 of the name.
 I would expect the assembler name (minus the random seed mangling) to be 
 sufficient.

In theory, assembler name should be good enough. But in practice, we
still see that for some functions, assembler name is not available.
This function purely exists to tolerate faults like this. And should
be simplified once debug info becomes perfect.


 
  +  if (DECL_ABSTRACT_ORIGIN (decl))
  +return get_function_instance_by_decl (lineno, DECL_ABSTRACT_ORIGIN 
  (decl));
 
  What really happens for ipa-cp function clones and for split functions?

 Their name suffix will be stripped before matching names.

 I see and profile merged, that seems resonable.
  I am not sure it is a win to have the things represented as VPT histograms 
  when you shoot
  for quite special purpose problem.  But lets deal with these 
  incrementally, too.
  (in general I do not see why you try to share so much with the gcov based 
  VTP - it seems
  easier to do the speculation by hand)

 Basically there are 2 problem:

 * before annotation. I agree this is a special purpose problem
 * after annotation. This should be the same problem as VPT.

 Yep, I think adding VPT histograms to get devirtualization happen 
 inter-procedurally
 with LTO is a good idea.
 You will need (incrementally I guess) to translate your indexes into the 
 profile-id used
 or you will need to initialize profile-ids accordingly, so the IPA pass is 
 able to identify
 the target cross-module.

Sure, let's do this incrementally. But for the symbol look up, we
actually use func_name instead of profile-id. So I guess we don't have
the problem you mentioned here? (except that function name my
collide).

  +{
  +  if (gcov_open (auto_profile_file, 1) == 0)
  +error (Cannot open profile file %s., auto_profile_file);
  +
  +  if (gcov_read_unsigned () != GCOV_DATA_MAGIC)
  +error (AutoFDO profile magic number does not mathch.);
  +
  +  /* Skip the version number.  */
  +  gcov_read_unsigned ();
  +
  +  /* Skip the empty integer.  */
  +  gcov_read_unsigned ();
 
  Perhaps we can just check the values?

 What do you mean? Check the value against 0?

 Yes, if you have version in there it seems to make sense to check it ;)

Done

 I guess it is there so you can add extra stuff into the file format that will 
 make
 it incompatible with current one, so probably we should reject all versions 
 greater
 than 0. (especially because the tool is off-tree)
 
  Does this have chance to work inter-module?

 Yes, it works fine with LIPO.

 Important (for me) is to make it work with LTO, because LIPO is apparently 
 not landing to 5.0.

I think we only need to handle function name colliding case. This
could be simply be resolved by coupling function name with file name.


 Yes, this could be done.

 After a second thought, I think we actually need to expose einliner
 interface to autofdo (instead of doing vpt in early inliner). This is
 because we need to mark icall as promoted, so that later vpt passes
 will not try to promote it again. Currently in AutoFDO, we use a
 self-contained way (use a set to mark all promoted stmts). If we do
 vpt in einline, then we will need some flag attached to gimple to
 indicate whether it's already promoted.

 If we manage to remove the iteration loop that repeativly applies the inline 
 plan, you only
 need way to mark edge as promoted.  I guess you can just test the speculative 
 flag on the edge
 to skip ones you dealt with earlier.

 Lets do that incrementally though.

Agree, let's do it incrementally.

 +
 +@item -fauto-profile
 +@itemx -fauto-profile=@var{path}
 +@opindex fauto-profile
 +Enable sampling based feedback directed optimizations, and optimizations
 +generally profitable only with profile feedback available.
 +
 +The following options are enabled: @code{-fbranch-probabilities}, 
 @code{-fvpt},
 +@code{-funroll-loops}, @code{-fpeel-loops}, @code{-ftracer}, 
 @code{-ftree-vectorize},
 

Re: [PATCH] AutoFDO patch for trunk

2014-10-19 Thread Jan Hubicka
  +/* Member functions for string_table.  */
  +
  +string_table *
  +string_table::create ()
 
  Why this is not a constructor?
 
 We use static initializer because it's not suggested to put too much
 logic in constructor.

Why not? :)
  +}
 
  The two hunks probably can be unified.
  Why get_index_by_decl does not already use dwarf name and why do you need 
  to consider both?
 
 get_index_by_decl is actually a wrapper of get_index. It tolerates
 error when assembler name is not emitted in the debug binary (mostly
 for functions that are fully inlined). In this case, we will first try
 to find the index of the assembler name, if not found, then bfd name.
 If still not found, then we try to find name from its abstract
 location.

Yep, I am just somewhat concerned that you need to try different variations of 
the name.
I would expect the assembler name (minus the random seed mangling) to be 
sufficient.
 
 
  +  if (DECL_ABSTRACT_ORIGIN (decl))
  +return get_function_instance_by_decl (lineno, DECL_ABSTRACT_ORIGIN 
  (decl));
 
  What really happens for ipa-cp function clones and for split functions?
 
 Their name suffix will be stripped before matching names.

I see and profile merged, that seems resonable.
  I am not sure it is a win to have the things represented as VPT histograms 
  when you shoot
  for quite special purpose problem.  But lets deal with these incrementally, 
  too.
  (in general I do not see why you try to share so much with the gcov based 
  VTP - it seems
  easier to do the speculation by hand)
 
 Basically there are 2 problem:
 
 * before annotation. I agree this is a special purpose problem
 * after annotation. This should be the same problem as VPT.

Yep, I think adding VPT histograms to get devirtualization happen 
inter-procedurally
with LTO is a good idea.
You will need (incrementally I guess) to translate your indexes into the 
profile-id used
or you will need to initialize profile-ids accordingly, so the IPA pass is able 
to identify
the target cross-module.
  +{
  +  if (gcov_open (auto_profile_file, 1) == 0)
  +error (Cannot open profile file %s., auto_profile_file);
  +
  +  if (gcov_read_unsigned () != GCOV_DATA_MAGIC)
  +error (AutoFDO profile magic number does not mathch.);
  +
  +  /* Skip the version number.  */
  +  gcov_read_unsigned ();
  +
  +  /* Skip the empty integer.  */
  +  gcov_read_unsigned ();
 
  Perhaps we can just check the values?
 
 What do you mean? Check the value against 0?

Yes, if you have version in there it seems to make sense to check it ;)
I guess it is there so you can add extra stuff into the file format that will 
make
it incompatible with current one, so probably we should reject all versions 
greater
than 0. (especially because the tool is off-tree)
 
  Does this have chance to work inter-module?
 
 Yes, it works fine with LIPO.

Important (for me) is to make it work with LTO, because LIPO is apparently not 
landing to 5.0.
 
 Yes, this could be done.
 
 After a second thought, I think we actually need to expose einliner
 interface to autofdo (instead of doing vpt in early inliner). This is
 because we need to mark icall as promoted, so that later vpt passes
 will not try to promote it again. Currently in AutoFDO, we use a
 self-contained way (use a set to mark all promoted stmts). If we do
 vpt in einline, then we will need some flag attached to gimple to
 indicate whether it's already promoted.

If we manage to remove the iteration loop that repeativly applies the inline 
plan, you only
need way to mark edge as promoted.  I guess you can just test the speculative 
flag on the edge
to skip ones you dealt with earlier.

Lets do that incrementally though.
 +
 +@item -fauto-profile
 +@itemx -fauto-profile=@var{path}
 +@opindex fauto-profile
 +Enable sampling based feedback directed optimizations, and optimizations
 +generally profitable only with profile feedback available.
 +
 +The following options are enabled: @code{-fbranch-probabilities}, 
 @code{-fvpt},
 +@code{-funroll-loops}, @code{-fpeel-loops}, @code{-ftracer}, 
 @code{-ftree-vectorize},
 +@code{ftree-loop-distribute-patterns}

I wonder about loop peeling - we do not enale it by default because we do not 
believe
our static branch predictor can well identify loops that have low expected trip 
count
(except ones that can be fully inlined).
I am not sure auto-FDO is much better here - how reliable the info about number 
of iteratoins
is?

But this is again something that can be handled separately.

Documentation seems out of date though - it seems to enable also
-finline, -fipa-cp and cloning, predictive commoning, unswitching, 
gcse-after-reload,

You do not want to enable reorder-functions because autoFDO does not include 
the time profiler.
Also I think speculative devirtualization should not be disabled.
 +
 +If @var{path} is specified, GCC looks at the @var{path} to find
 +the profile feedback data files.
 +
 +In order to collect AutoFDO profile, you need to 

Re: [PATCH] AutoFDO patch for trunk

2014-10-16 Thread Dehao Chen
Hi, Honza,

I've integrated all your comments to the patch. New patch attached.

Thanks,
Dehao



On Wed, Oct 15, 2014 at 7:28 AM, Jan Hubicka hubi...@ucw.cz wrote:
 Index: gcc/cfgloop.c
 ===
 --- gcc/cfgloop.c (revision 215826)
 +++ gcc/cfgloop.c (working copy)
 @@ -1802,7 +1802,7 @@ record_niter_bound (struct loop *loop, const wides
  }
if (realistic
 (!loop-any_estimate
 -   || wi::ltu_p (i_bound, loop-nb_iterations_estimate)))
 +   || (!flag_auto_profile  wi::ltu_p (i_bound, 
 loop-nb_iterations_estimate

 I do not remember this hunk from previous patches.  So you make autofdo to 
 drop
 the conditional that new estimate must be smaller than preivously derived.  I
 suppose it is because you want autofdo estimates to be overwritten by anything
 determined later.

 Lets drop this hunk now, too.  I think we really want to have way to track if
 the estimate is from autofdo and thus not too reliable but better than
 nothing and we may want to introduce some kind of reliability metric on way
 estimates are derived.

 Lets handle this incrementally and drop this change from this patch.

Done.


 Index: gcc/doc/invoke.texi
 ===
 --- gcc/doc/invoke.texi   (revision 215826)
 +++ gcc/doc/invoke.texi   (working copy)
 @@ -365,7 +365,8 @@ Objective-C and Objective-C++ Dialects}.
  @gccoptlist{-faggressive-loop-optimizations -falign-functions[=@var{n}] @gol
  -falign-jumps[=@var{n}] @gol
  -falign-labels[=@var{n}] -falign-loops[=@var{n}] @gol
 --fassociative-math -fauto-inc-dec -fbranch-probabilities @gol
 +-fassociative-math -fauto-profile -fauto-profile[=@var{path}] @gol
 +-fauto-inc-dec -fbranch-probabilities @gol
  -fbranch-target-load-optimize -fbranch-target-load-optimize2 @gol
  -fbtr-bb-exclusive -fcaller-saves @gol
  -fcheck-data-deps -fcombine-stack-adjustments -fconserve-stack @gol
 @@ -9164,6 +9165,19 @@ code.

  If @var{path} is specified, GCC looks at the @var{path} to find
  the profile feedback data files. See @option{-fprofile-dir}.
 +
 +@item -fauto-profile
 +@itemx -fauto-profile=@var{path}
 +@opindex fauto-profile
 +Enable sampling based feedback directed optimizations, and optimizations
 +generally profitable only with profile feedback available.
 +
 +The following options are enabled: @code{-fbranch-probabilities}, 
 @code{-fvpt},
 +@code{-funroll-loops}, @code{-fpeel-loops}, @code{-ftracer}, 
 @code{-ftree-vectorize},
 +@code{ftree-loop-distribute-patterns}
 +
 +If @var{path} is specified, GCC looks at the @var{path} to find
 +the profile feedback data files.

 I agree with Andi that you need to write more here - basically it should be 
 clear how to
 profile the application and use the data.

 We do use external tools here (oprofile/gperf/non-Linux tools I suppose) plus 
 an
 conversion tool. So you need to provide some references to those.

 Actually I think it would be really nice to add tutorial-like chapter into GCC
 manual about using FDO/autoFDO to improve your code or even more general aimed
 covering advanced topics like correct LTO use, profile feedback and other fun.

Done, added some little example to demonstrate how to collect profile.

 Index: gcc/value-prof.c
 ===
 --- gcc/value-prof.c  (revision 215826)
 +++ gcc/value-prof.c  (working copy)
 @@ -134,11 +134,10 @@ static bool gimple_divmod_fixed_value_transform (g
  static bool gimple_mod_pow2_value_transform (gimple_stmt_iterator *);
  static bool gimple_mod_subtract_transform (gimple_stmt_iterator *);
  static bool gimple_stringops_transform (gimple_stmt_iterator *);
 -static bool gimple_ic_transform (gimple_stmt_iterator *);

  /* Allocate histogram value.  */

 -static histogram_value
 +histogram_value
  gimple_alloc_histogram_value (struct function *fun ATTRIBUTE_UNUSED,
 enum hist_type type, gimple stmt, tree value)

 Do we still need to expor those?

Yes, as soon as we are faking histogram.


  {
 @@ -1307,6 +1306,9 @@ del_node_map (void)
  struct cgraph_node*
  find_func_by_profile_id (int profile_id)
  {
 +  if (flag_auto_profile)
 +return cgraph_node::get_for_asmname (
 + get_identifier ((const char *)(size_t)profile_id));

 This definitly needs an comment.  Why you need to use find_func_by_profile_id 
 on both
 gcov profiling and auto profiling paths?
 Perhaps you can use your own fnuction here?

 Or perhaps this is not necessary if you implement directly the indirect call 
 optimiization?

Done


 We may try to head towards of having way to read both coverage and auto 
 profile into compiler
 eventually. It may be useful to get extra precision from gcov profile and to 
 measure data
 like cache misses with auto profile.
cgraph_node **val = cgraph_node_map-get (profile_id);
if (val)
  return *val;
 @@ -1320,7 +1322,7 @@ 

Re: [PATCH] AutoFDO patch for trunk

2014-10-15 Thread Jan Hubicka
 Index: gcc/cfgloop.c
 ===
 --- gcc/cfgloop.c (revision 215826)
 +++ gcc/cfgloop.c (working copy)
 @@ -1802,7 +1802,7 @@ record_niter_bound (struct loop *loop, const wides
  }
if (realistic
 (!loop-any_estimate
 -   || wi::ltu_p (i_bound, loop-nb_iterations_estimate)))
 +   || (!flag_auto_profile  wi::ltu_p (i_bound, 
 loop-nb_iterations_estimate

I do not remember this hunk from previous patches.  So you make autofdo to drop
the conditional that new estimate must be smaller than preivously derived.  I
suppose it is because you want autofdo estimates to be overwritten by anything
determined later.

Lets drop this hunk now, too.  I think we really want to have way to track if
the estimate is from autofdo and thus not too reliable but better than
nothing and we may want to introduce some kind of reliability metric on way
estimates are derived.

Lets handle this incrementally and drop this change from this patch.

 Index: gcc/doc/invoke.texi
 ===
 --- gcc/doc/invoke.texi   (revision 215826)
 +++ gcc/doc/invoke.texi   (working copy)
 @@ -365,7 +365,8 @@ Objective-C and Objective-C++ Dialects}.
  @gccoptlist{-faggressive-loop-optimizations -falign-functions[=@var{n}] @gol
  -falign-jumps[=@var{n}] @gol
  -falign-labels[=@var{n}] -falign-loops[=@var{n}] @gol
 --fassociative-math -fauto-inc-dec -fbranch-probabilities @gol
 +-fassociative-math -fauto-profile -fauto-profile[=@var{path}] @gol
 +-fauto-inc-dec -fbranch-probabilities @gol
  -fbranch-target-load-optimize -fbranch-target-load-optimize2 @gol
  -fbtr-bb-exclusive -fcaller-saves @gol
  -fcheck-data-deps -fcombine-stack-adjustments -fconserve-stack @gol
 @@ -9164,6 +9165,19 @@ code.
  
  If @var{path} is specified, GCC looks at the @var{path} to find
  the profile feedback data files. See @option{-fprofile-dir}.
 +
 +@item -fauto-profile
 +@itemx -fauto-profile=@var{path}
 +@opindex fauto-profile
 +Enable sampling based feedback directed optimizations, and optimizations
 +generally profitable only with profile feedback available.
 +
 +The following options are enabled: @code{-fbranch-probabilities}, 
 @code{-fvpt},
 +@code{-funroll-loops}, @code{-fpeel-loops}, @code{-ftracer}, 
 @code{-ftree-vectorize},
 +@code{ftree-loop-distribute-patterns}
 +
 +If @var{path} is specified, GCC looks at the @var{path} to find
 +the profile feedback data files.

I agree with Andi that you need to write more here - basically it should be 
clear how to
profile the application and use the data.

We do use external tools here (oprofile/gperf/non-Linux tools I suppose) plus an
conversion tool. So you need to provide some references to those.

Actually I think it would be really nice to add tutorial-like chapter into GCC
manual about using FDO/autoFDO to improve your code or even more general aimed
covering advanced topics like correct LTO use, profile feedback and other fun.
 Index: gcc/value-prof.c
 ===
 --- gcc/value-prof.c  (revision 215826)
 +++ gcc/value-prof.c  (working copy)
 @@ -134,11 +134,10 @@ static bool gimple_divmod_fixed_value_transform (g
  static bool gimple_mod_pow2_value_transform (gimple_stmt_iterator *);
  static bool gimple_mod_subtract_transform (gimple_stmt_iterator *);
  static bool gimple_stringops_transform (gimple_stmt_iterator *);
 -static bool gimple_ic_transform (gimple_stmt_iterator *);
  
  /* Allocate histogram value.  */
  
 -static histogram_value
 +histogram_value
  gimple_alloc_histogram_value (struct function *fun ATTRIBUTE_UNUSED,
 enum hist_type type, gimple stmt, tree value)

Do we still need to expor those?

  {
 @@ -1307,6 +1306,9 @@ del_node_map (void)
  struct cgraph_node*
  find_func_by_profile_id (int profile_id)
  {
 +  if (flag_auto_profile)
 +return cgraph_node::get_for_asmname (
 + get_identifier ((const char *)(size_t)profile_id));

This definitly needs an comment.  Why you need to use find_func_by_profile_id 
on both
gcov profiling and auto profiling paths? 
Perhaps you can use your own fnuction here?

Or perhaps this is not necessary if you implement directly the indirect call 
optimiization?

We may try to head towards of having way to read both coverage and auto profile 
into compiler
eventually. It may be useful to get extra precision from gcov profile and to 
measure data
like cache misses with auto profile.
cgraph_node **val = cgraph_node_map-get (profile_id);
if (val)
  return *val;
 @@ -1320,7 +1322,7 @@ find_func_by_profile_id (int profile_id)
 may ICE. Here we only do very minimal sanity check just to make compiler 
 happy.
 Returns true if TARGET is considered ok for call CALL_STMT.  */
  
 -static bool
 +bool
  check_ic_target (gimple call_stmt, struct cgraph_node *target)
  {
 location_t locus;
 @@ -1483,7 +1485,7 @@ 

Re: [PATCH] AutoFDO patch for trunk

2014-10-14 Thread Jan Hubicka
 Index: gcc/cgraphclones.c
 ===
 --- gcc/cgraphclones.c(revision 215826)
 +++ gcc/cgraphclones.c(working copy)
 @@ -453,6 +453,11 @@
  }
else
  count_scale = 0;
 +  /* In AutoFDO, if edge count is larger than callee's entry block
 + count, we will not update the original callee because it may
 + mistakenly mark some hot function as cold.  */
 +  if (flag_auto_profile  gcov_count = count)
 +update_original = false;

lets drop this from initial patch.

 Index: gcc/bb-reorder.c
 ===
 --- gcc/bb-reorder.c  (revision 215826)
 +++ gcc/bb-reorder.c  (working copy)
 @@ -1569,15 +1569,14 @@
/* Mark which partition (hot/cold) each basic block belongs in.  */
FOR_EACH_BB_FN (bb, cfun)
  {
 -  bool cold_bb = false;
 +  bool cold_bb = probably_never_executed_bb_p (cfun, bb);

and this too
(basically all the tweaks should IMO go in independently and ideally in
a way that does not need flag_auto_profile test).
 +/* Return true if BB contains indirect call.  */
 +
 +static bool
 +has_indirect_call (basic_block bb)
 +{
 +  gimple_stmt_iterator gsi;
 +
 +  for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (gsi))
 +{
 +  gimple stmt = gsi_stmt (gsi);
 +  if (gimple_code (stmt) == GIMPLE_CALL
 +(gimple_call_fn (stmt) == NULL
 +   || TREE_CODE (gimple_call_fn (stmt)) != FUNCTION_DECL))

You probably want to skip gimple_call_internal_p calls here.
 +
 +/* From AutoFDO profiles, find values inside STMT for that we want to measure
 +   histograms for indirect-call optimization.  */
 +
 +static void
 +afdo_indirect_call (gimple_stmt_iterator *gsi, const icall_target_map map,
 + bool transform)
 +{
 +  gimple stmt = gsi_stmt (*gsi);
 +  tree callee;
 +
 +  if (map.size() == 0 || gimple_code (stmt) != GIMPLE_CALL
 +  || gimple_call_fndecl (stmt) != NULL_TREE)
 +return;
 +
 +  callee = gimple_call_fn (stmt);
 +
 +  histogram_value hist = gimple_alloc_histogram_value (
 +  cfun, HIST_TYPE_INDIR_CALL, stmt, callee);
 +  hist-n_counters = 3;
 +  hist-hvalue.counters =  XNEWVEC (gcov_type, hist-n_counters);
 +  gimple_add_histogram_value (cfun, stmt, hist);
 +
 +  gcov_type total = 0;
 +  icall_target_map::const_iterator max_iter = map.end();
 +
 +  for (icall_target_map::const_iterator iter = map.begin();
 +   iter != map.end(); ++iter)
 +{
 +  total += iter-second;
 +  if (max_iter == map.end() || max_iter-second  iter-second)
 + max_iter = iter;
 +}
 +
 +  hist-hvalue.counters[0] = (unsigned long long)
 +  afdo_string_table-get_name (max_iter-first);
 +  hist-hvalue.counters[1] = max_iter-second;
 +  hist-hvalue.counters[2] = total;
 +
 +  if (!transform)
 +return;
 +
 +  if (gimple_ic_transform (gsi))
 +{
 +  struct cgraph_edge *indirect_edge =
 +   cgraph_node::get (current_function_decl)-get_edge (stmt);
 +  struct cgraph_node *direct_call =
 +   find_func_by_profile_id ((int)hist-hvalue.counters[0]);
 +  if (DECL_STRUCT_FUNCTION (direct_call-decl) == NULL)
 + return;
 +  struct cgraph_edge *new_edge =
 +   indirect_edge-make_speculative (direct_call, 0, 0);
 +  new_edge-redirect_call_stmt_to_callee ();
 +  gimple_remove_histogram_value (cfun, stmt, hist);
 +  inline_call (new_edge, true, NULL, NULL, false);
 +  return;
 +}
 +  return;

Is it necessary to go via histogram and gimple_ic_transform here?  I would 
expect that all you
need is to make the speculative edge and inline it. (bypassing the work of 
producing fake
histogram value and calling igmple_ic_transofrm on it)

Also it seems to me that you want to set direct_count nad frequency argument of
make_speculative so the resulting function profile is not off.

The rest of interfaces seems quite sane now.  Can you please look into
using speculative edges directly instead of hooking into the vpt infrastructure
and fixing the formatting issues of the new pass?

I will try to make another pass over the actual streaming logic that I find bit 
difficult
to read, but I quite trust you it does the right thing ;)

Honza


Re: [PATCH] AutoFDO patch for trunk

2014-10-14 Thread Dehao Chen
On Tue, Oct 14, 2014 at 8:02 AM, Jan Hubicka hubi...@ucw.cz wrote:
 Index: gcc/cgraphclones.c
 ===
 --- gcc/cgraphclones.c(revision 215826)
 +++ gcc/cgraphclones.c(working copy)
 @@ -453,6 +453,11 @@
  }
else
  count_scale = 0;
 +  /* In AutoFDO, if edge count is larger than callee's entry block
 + count, we will not update the original callee because it may
 + mistakenly mark some hot function as cold.  */
 +  if (flag_auto_profile  gcov_count = count)
 +update_original = false;

 lets drop this from initial patch.

Done

 Index: gcc/bb-reorder.c
 ===
 --- gcc/bb-reorder.c  (revision 215826)
 +++ gcc/bb-reorder.c  (working copy)
 @@ -1569,15 +1569,14 @@
/* Mark which partition (hot/cold) each basic block belongs in.  */
FOR_EACH_BB_FN (bb, cfun)
  {
 -  bool cold_bb = false;
 +  bool cold_bb = probably_never_executed_bb_p (cfun, bb);

 and this too
 (basically all the tweaks should IMO go in independently and ideally in
 a way that does not need flag_auto_profile test).

Done.

 +/* Return true if BB contains indirect call.  */
 +
 +static bool
 +has_indirect_call (basic_block bb)
 +{
 +  gimple_stmt_iterator gsi;
 +
 +  for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (gsi))
 +{
 +  gimple stmt = gsi_stmt (gsi);
 +  if (gimple_code (stmt) == GIMPLE_CALL
 +(gimple_call_fn (stmt) == NULL
 +   || TREE_CODE (gimple_call_fn (stmt)) != FUNCTION_DECL))

 You probably want to skip gimple_call_internal_p calls here.

Done

 +
 +/* From AutoFDO profiles, find values inside STMT for that we want to 
 measure
 +   histograms for indirect-call optimization.  */
 +
 +static void
 +afdo_indirect_call (gimple_stmt_iterator *gsi, const icall_target_map map,
 + bool transform)
 +{
 +  gimple stmt = gsi_stmt (*gsi);
 +  tree callee;
 +
 +  if (map.size() == 0 || gimple_code (stmt) != GIMPLE_CALL
 +  || gimple_call_fndecl (stmt) != NULL_TREE)
 +return;
 +
 +  callee = gimple_call_fn (stmt);
 +
 +  histogram_value hist = gimple_alloc_histogram_value (
 +  cfun, HIST_TYPE_INDIR_CALL, stmt, callee);
 +  hist-n_counters = 3;
 +  hist-hvalue.counters =  XNEWVEC (gcov_type, hist-n_counters);
 +  gimple_add_histogram_value (cfun, stmt, hist);
 +
 +  gcov_type total = 0;
 +  icall_target_map::const_iterator max_iter = map.end();
 +
 +  for (icall_target_map::const_iterator iter = map.begin();
 +   iter != map.end(); ++iter)
 +{
 +  total += iter-second;
 +  if (max_iter == map.end() || max_iter-second  iter-second)
 + max_iter = iter;
 +}
 +
 +  hist-hvalue.counters[0] = (unsigned long long)
 +  afdo_string_table-get_name (max_iter-first);
 +  hist-hvalue.counters[1] = max_iter-second;
 +  hist-hvalue.counters[2] = total;
 +
 +  if (!transform)
 +return;
 +
 +  if (gimple_ic_transform (gsi))
 +{
 +  struct cgraph_edge *indirect_edge =
 +   cgraph_node::get (current_function_decl)-get_edge (stmt);
 +  struct cgraph_node *direct_call =
 +   find_func_by_profile_id ((int)hist-hvalue.counters[0]);
 +  if (DECL_STRUCT_FUNCTION (direct_call-decl) == NULL)
 + return;
 +  struct cgraph_edge *new_edge =
 +   indirect_edge-make_speculative (direct_call, 0, 0);
 +  new_edge-redirect_call_stmt_to_callee ();
 +  gimple_remove_histogram_value (cfun, stmt, hist);
 +  inline_call (new_edge, true, NULL, NULL, false);
 +  return;
 +}
 +  return;

 Is it necessary to go via histogram and gimple_ic_transform here?  I would 
 expect that all you
 need is to make the speculative edge and inline it. (bypassing the work of 
 producing fake
 histogram value and calling igmple_ic_transofrm on it)

 Also it seems to me that you want to set direct_count nad frequency argument 
 of
 make_speculative so the resulting function profile is not off.

This function is actually served for 2 purposes:

* before annotation, we need to mark histogram, promote and inline
* after annotation, we just need to mark, and let follow-up logic to
decide if it needs to promote and inline.

And you are right, for the before annotation case, we can simply
call mark speculative and inline. But we still need the logic to
fake histogram for after annotation case. As a result, I unified two
cases into one function to reuse code as much as possible. Shall I
separate it into two functions instead?


 The rest of interfaces seems quite sane now.  Can you please look into
 using speculative edges directly instead of hooking into the vpt 
 infrastructure
 and fixing the formatting issues of the new pass?

I'll work on the formatting issues now (need to learn the format first
;-). The attached patch is up-to-date except for formatting changes.
I'll upload the patch again once the format change is in.

Thanks,
Dehao


 I will try to make another pass over 

Re: [PATCH] AutoFDO patch for trunk

2014-10-14 Thread Dehao Chen
The new patch is attached. I used clang-format for format auto-profile.{c|h}

Thanks,
Dehao

On Tue, Oct 14, 2014 at 2:05 PM, Dehao Chen de...@google.com wrote:
 On Tue, Oct 14, 2014 at 8:02 AM, Jan Hubicka hubi...@ucw.cz wrote:
 Index: gcc/cgraphclones.c
 ===
 --- gcc/cgraphclones.c(revision 215826)
 +++ gcc/cgraphclones.c(working copy)
 @@ -453,6 +453,11 @@
  }
else
  count_scale = 0;
 +  /* In AutoFDO, if edge count is larger than callee's entry block
 + count, we will not update the original callee because it may
 + mistakenly mark some hot function as cold.  */
 +  if (flag_auto_profile  gcov_count = count)
 +update_original = false;

 lets drop this from initial patch.

 Done

 Index: gcc/bb-reorder.c
 ===
 --- gcc/bb-reorder.c  (revision 215826)
 +++ gcc/bb-reorder.c  (working copy)
 @@ -1569,15 +1569,14 @@
/* Mark which partition (hot/cold) each basic block belongs in.  */
FOR_EACH_BB_FN (bb, cfun)
  {
 -  bool cold_bb = false;
 +  bool cold_bb = probably_never_executed_bb_p (cfun, bb);

 and this too
 (basically all the tweaks should IMO go in independently and ideally in
 a way that does not need flag_auto_profile test).

 Done.

 +/* Return true if BB contains indirect call.  */
 +
 +static bool
 +has_indirect_call (basic_block bb)
 +{
 +  gimple_stmt_iterator gsi;
 +
 +  for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (gsi))
 +{
 +  gimple stmt = gsi_stmt (gsi);
 +  if (gimple_code (stmt) == GIMPLE_CALL
 +(gimple_call_fn (stmt) == NULL
 +   || TREE_CODE (gimple_call_fn (stmt)) != FUNCTION_DECL))

 You probably want to skip gimple_call_internal_p calls here.

 Done

 +
 +/* From AutoFDO profiles, find values inside STMT for that we want to 
 measure
 +   histograms for indirect-call optimization.  */
 +
 +static void
 +afdo_indirect_call (gimple_stmt_iterator *gsi, const icall_target_map map,
 + bool transform)
 +{
 +  gimple stmt = gsi_stmt (*gsi);
 +  tree callee;
 +
 +  if (map.size() == 0 || gimple_code (stmt) != GIMPLE_CALL
 +  || gimple_call_fndecl (stmt) != NULL_TREE)
 +return;
 +
 +  callee = gimple_call_fn (stmt);
 +
 +  histogram_value hist = gimple_alloc_histogram_value (
 +  cfun, HIST_TYPE_INDIR_CALL, stmt, callee);
 +  hist-n_counters = 3;
 +  hist-hvalue.counters =  XNEWVEC (gcov_type, hist-n_counters);
 +  gimple_add_histogram_value (cfun, stmt, hist);
 +
 +  gcov_type total = 0;
 +  icall_target_map::const_iterator max_iter = map.end();
 +
 +  for (icall_target_map::const_iterator iter = map.begin();
 +   iter != map.end(); ++iter)
 +{
 +  total += iter-second;
 +  if (max_iter == map.end() || max_iter-second  iter-second)
 + max_iter = iter;
 +}
 +
 +  hist-hvalue.counters[0] = (unsigned long long)
 +  afdo_string_table-get_name (max_iter-first);
 +  hist-hvalue.counters[1] = max_iter-second;
 +  hist-hvalue.counters[2] = total;
 +
 +  if (!transform)
 +return;
 +
 +  if (gimple_ic_transform (gsi))
 +{
 +  struct cgraph_edge *indirect_edge =
 +   cgraph_node::get (current_function_decl)-get_edge (stmt);
 +  struct cgraph_node *direct_call =
 +   find_func_by_profile_id ((int)hist-hvalue.counters[0]);
 +  if (DECL_STRUCT_FUNCTION (direct_call-decl) == NULL)
 + return;
 +  struct cgraph_edge *new_edge =
 +   indirect_edge-make_speculative (direct_call, 0, 0);
 +  new_edge-redirect_call_stmt_to_callee ();
 +  gimple_remove_histogram_value (cfun, stmt, hist);
 +  inline_call (new_edge, true, NULL, NULL, false);
 +  return;
 +}
 +  return;

 Is it necessary to go via histogram and gimple_ic_transform here?  I would 
 expect that all you
 need is to make the speculative edge and inline it. (bypassing the work of 
 producing fake
 histogram value and calling igmple_ic_transofrm on it)

 Also it seems to me that you want to set direct_count nad frequency argument 
 of
 make_speculative so the resulting function profile is not off.

 This function is actually served for 2 purposes:

 * before annotation, we need to mark histogram, promote and inline
 * after annotation, we just need to mark, and let follow-up logic to
 decide if it needs to promote and inline.

 And you are right, for the before annotation case, we can simply
 call mark speculative and inline. But we still need the logic to
 fake histogram for after annotation case. As a result, I unified two
 cases into one function to reuse code as much as possible. Shall I
 separate it into two functions instead?


 The rest of interfaces seems quite sane now.  Can you please look into
 using speculative edges directly instead of hooking into the vpt 
 infrastructure
 and fixing the formatting issues of the new pass?

 I'll work on the formatting issues now (need to learn the format first
 ;-). The 

Re: [PATCH] AutoFDO patch for trunk

2014-10-14 Thread Andi Kleen
Dehao Chen de...@google.com writes:
 +
 +@item -fauto-profile
 +@itemx -fauto-profile=@var{path}
 +@opindex fauto-profile
 +Enable sampling based feedback directed optimizations, and optimizations
 +generally profitable only with profile feedback available.
 +
 +The following options are enabled: @code{-fbranch-probabilities}, 
 @code{-fvpt},
 +@code{-funroll-loops}, @code{-fpeel-loops}, @code{-ftracer}, 
 @code{-ftree-vectorize},
 +@code{ftree-loop-distribute-patterns}

This needs more description aimed end-users, what it is good for and why,
and a pointer to the needed utilities and a short summary
what steps they need to take.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only


Re: [PATCH] AutoFDO patch for trunk

2014-10-09 Thread Jan Hubicka
  Index: gcc/bb-reorder.c
  ===
  --- gcc/bb-reorder.c  (revision 210180)
  +++ gcc/bb-reorder.c  (working copy)
  @@ -2663,7 +2663,7 @@ pass_partition_blocks::gate (function *fun)
user defined section attributes.  Don't call it if either case
arises.  */
 return (flag_reorder_blocks_and_partition
  -optimize
  +optimize !flag_auto_profile
 
  Formatting error. Why we want !flag_auto_profile? Can't we just arrange
  flag_reorder_blocks_and_partition to default to false?
 
  In fact I would like to see flag_reorder_blocks_and_partition to offload 
  obviously
  cold code (in partiuclar the EH cleanups) even without profile. That should 
  work
  well with autofdo.
 
 This will cause bzip2 performance to degrade 6%. I haven't had time to
 triage the problem. Will investigate this later.

Still I would preffer to make this by default flag_reorder_blocks_and_partition
to false with auto_profile.  We could do that incrementally, lets just drop
this from initial patch.
  -  if (flag_branch_probabilities)
  +  if (flag_auto_profile)
  +init_auto_profile ();
 
  Perhaps initialization can happen from toplev.c instead?
 
 This is actually reading of afdo profile. I've renamed the function to
 make it consistent.

OK.

 
  Index: gcc/cgraphclones.c
  ===
  --- gcc/cgraphclones.c(revision 210180)
  +++ gcc/cgraphclones.c(working copy)
  @@ -435,6 +435,11 @@ cgraph_clone_node (struct cgraph_node *n, tree dec
   }
 else
   count_scale = 0;
  +  /* In AutoFDO, if edge count is larger than callee's entry block
  + count, we will not update the original callee because it may
  + mistakenly mark some hot function as cold.  */
  +  if (flag_auto_profile  count = n-count)
  +update_original = false;
 
  This looks like a hack. Lets go with it later - I see what you are shooting 
  for here
  and non-autoFDO has kind of same problem, too.  Counts are not at all that 
  reliable
  after some inlining.
 
  I wonder how this hack would fare for -fprofile-use
 
 It does not affect -fprofile-use performance in my testing. However,
 for AutoFDO, this happens much more often.

Yep, lets however drop it from the initial patch and make the tweaks go in 
later.
 
  @@ -1286,6 +1285,9 @@ del_node_map (void)
   struct cgraph_node*
   find_func_by_profile_id (int profile_id)
   {
  +  if (flag_auto_profile)
  +return cgraph_node_for_asm (
  + get_identifier ((const char *)(size_t)profile_id));
 
  I think with LTO we will need to make this machinery safe for symbol 
  mangling.  Any plans on this?
 
 No plan yet. Let's get AutoFDO in first, and spend some separate
 effort to make AutoFDO+LTO work?

OK.
 
  Index: gcc/ipa-inline.h
  ===
  --- gcc/ipa-inline.h  (revision 210180)
  +++ gcc/ipa-inline.h  (working copy)
  @@ -345,3 +345,5 @@ reset_edge_growth_cache (struct cgraph_edge *edge)
 edge_growth_cache[edge-uid] = zero;
   }
   }
  +
  +unsigned int early_inliner (function *fun);
 
  Can't this be handled by scheduling early inliner as subpass of autofdo or 
  something similar?
  I would rather make this explicit in PM.
 
 See my comment below in auto-profile.c
 
  Index: gcc/auto-profile.c
  ===
  --- gcc/auto-profile.c(revision 0)
  +++ gcc/auto-profile.c(revision 0)
  +   After the above 3 phases, all profile is readily annotated on the GCC 
  IR.
  +   AutoFDO tries to reuse all FDO infrastructure as much as possible to 
  make
  +   use of the profile. E.g. it uses existing mechanism to calculate the 
  basic
  +   block/edge frequency, as well as the cgraph node/edge count.
  +*/
 
  I suppose the phases can be made explicit to the PM, so early inliner can 
  stay
  a pass.
 
 
 Unfortunately, this is more complicated:
 
 while(changed) {
   vpt();
   einline();
   annotate();
 }

Hmm, PM do not support iteration (though it wouldmake sense in some cases at 
least for
expensive optimization).  Why you need to iterate einline with the other two?
 
 
  +/* Store a string array, indexed by string position in the array.  */
  +class string_table {
  +public:
  +  static string_table *create ();
  +
  +  /* For a given string, returns its index.  */
  +  int get_index (const char *name) const;
  +
  +  /* For a given decl, returns the index of the decl name.  */
  +  int get_index_by_decl (tree decl) const;
  +
  +  /* For a given index, returns the string.  */
  +  const char *get_name (int index) const;
 
  I wonder how these should work with LTO.  I suppose we can tell user to not 
  LTO
  for train run, we should teach the table to ignore LTO generated suffixes as
  well as random seeds (like coverage code already does).
 
  What happens when two static functions 

Re: [PATCH] AutoFDO patch for trunk

2014-10-09 Thread Dehao Chen

 This will cause bzip2 performance to degrade 6%. I haven't had time to
 triage the problem. Will investigate this later.

 Still I would preffer to make this by default 
 flag_reorder_blocks_and_partition
 to false with auto_profile.  We could do that incrementally, lets just drop
 this from initial patch.

Sure, I'll remove that for now.

  Index: gcc/cgraphclones.c
  ===
  --- gcc/cgraphclones.c(revision 210180)
  +++ gcc/cgraphclones.c(working copy)
  @@ -435,6 +435,11 @@ cgraph_clone_node (struct cgraph_node *n, tree dec
   }
 else
   count_scale = 0;
  +  /* In AutoFDO, if edge count is larger than callee's entry block
  + count, we will not update the original callee because it may
  + mistakenly mark some hot function as cold.  */
  +  if (flag_auto_profile  count = n-count)
  +update_original = false;
 
  This looks like a hack. Lets go with it later - I see what you are 
  shooting for here
  and non-autoFDO has kind of same problem, too.  Counts are not at all that 
  reliable
  after some inlining.
 
  I wonder how this hack would fare for -fprofile-use

 It does not affect -fprofile-use performance in my testing. However,
 for AutoFDO, this happens much more often.

 Yep, lets however drop it from the initial patch and make the tweaks go in 
 later.

OK

  Index: gcc/auto-profile.c
  ===
  --- gcc/auto-profile.c(revision 0)
  +++ gcc/auto-profile.c(revision 0)
  +   After the above 3 phases, all profile is readily annotated on the GCC 
  IR.
  +   AutoFDO tries to reuse all FDO infrastructure as much as possible to 
  make
  +   use of the profile. E.g. it uses existing mechanism to calculate the 
  basic
  +   block/edge frequency, as well as the cgraph node/edge count.
  +*/
 
  I suppose the phases can be made explicit to the PM, so early inliner can 
  stay
  a pass.


 Unfortunately, this is more complicated:

 while(changed) {
   vpt();
   einline();
   annotate();
 }

 Hmm, PM do not support iteration (though it wouldmake sense in some cases at 
 least for
 expensive optimization).  Why you need to iterate einline with the other two?

It's for iterative AFDO. i.e. the profile could be collected from
AFDO-optimized binary.

E.g.

main() ind_call foo()
foo() ind_call bar()

First, profile is collected from O2, thus foo will be promoted and
inlined into main.

Then we profile on this AFDO-optimized binary, before annotation, we
need to make sure that foo is promoted and inlined main.

And in the next iteration, before annotation, we need to make sure bar
is promoted and inlined into foo, which is promoted and inlined into
main.



  +/* Store a string array, indexed by string position in the array.  */
  +class string_table {
  +public:
  +  static string_table *create ();
  +
  +  /* For a given string, returns its index.  */
  +  int get_index (const char *name) const;
  +
  +  /* For a given decl, returns the index of the decl name.  */
  +  int get_index_by_decl (tree decl) const;
  +
  +  /* For a given index, returns the string.  */
  +  const char *get_name (int index) const;
 
  I wonder how these should work with LTO.  I suppose we can tell user to 
  not LTO
  for train run, we should teach the table to ignore LTO generated suffixes 
  as
  well as random seeds (like coverage code already does).
 
  What happens when two static functions have same name?

 Shall we first focus on non-LTO version. And there could be separate
 effort to make AutoFDO+LTO work?

 For static functions, there will only be one copy in the profile. the
 profile will be merged. I agree that this could potentially cause
 problem. But OTOH, in a good practice, these functions should *not* be
 hot functions. At least in our practice, we didn't see any hot
 functions like this.

 You mean that static functions in differnt units that have same name tends to 
 be
 not hot? I don't really follow it...

I mean that if two static functions are both hot, it's better for them
to have different names. Otherwise performance tuning/debuging would
be super hard.


 I agree. My original intention is to make the function name handling
 as simple as possible. The current impl is based on the asumption that
 all important (hot) functions deserve a separate name (not just suffix
 difference). We could definitely refine the function name handling in
 the future patches.

  I would definitely merge logic here with logic in gcov that already knows 
  how to skip
  random seeds and other cases that makes filenames to diverge...

 Could you point me to the gcov logic that handles skipping function name 
 suffix?

 See coverage_checksum_string in coverage.c

Looks like that function is filling in randomly generated numbers by
0. This was not what I want because compiler may not even generate
these random numbers. E.g. for a given function, compiler 

Re: [PATCH] AutoFDO patch for trunk

2014-10-08 Thread Dehao Chen
Hi, Honza,

Sorry for the delay. I just picked up the original patch, and updated
it with your comments.

I've addressed most of your comments. Something else to discuss inlined.

I had refactored the patch to make it much less intrusive. New patch
is attached (ChangeLog will be added in the final version of the
patch). The performance of the new patch is the same as the original
patch.

Thanks,
Dehao

 Index: gcc/bb-reorder.c
 ===
 --- gcc/bb-reorder.c  (revision 210180)
 +++ gcc/bb-reorder.c  (working copy)
 @@ -2663,7 +2663,7 @@ pass_partition_blocks::gate (function *fun)
   user defined section attributes.  Don't call it if either case
   arises.  */
return (flag_reorder_blocks_and_partition
 -optimize
 +optimize !flag_auto_profile

 Formatting error. Why we want !flag_auto_profile? Can't we just arrange
 flag_reorder_blocks_and_partition to default to false?

 In fact I would like to see flag_reorder_blocks_and_partition to offload 
 obviously
 cold code (in partiuclar the EH cleanups) even without profile. That should 
 work
 well with autofdo.

This will cause bzip2 performance to degrade 6%. I haven't had time to
triage the problem. Will investigate this later.

 Index: gcc/coverage.c
 ===
 --- gcc/coverage.c(revision 210180)
 +++ gcc/coverage.c(working copy)
 @@ -54,6 +54,7 @@ along with GCC; see the file COPYING3.  If not see
  #include intl.h
  #include filenames.h
  #include target.h
 +#include auto-profile.h

  #include gcov-io.h
  #include gcov-io.c
 @@ -1175,7 +1176,9 @@ coverage_init (const char *filename)

bbg_file_stamp = local_tick;

 -  if (flag_branch_probabilities)
 +  if (flag_auto_profile)
 +init_auto_profile ();

 Perhaps initialization can happen from toplev.c instead?

This is actually reading of afdo profile. I've renamed the function to
make it consistent.

 Index: gcc/cgraphclones.c
 ===
 --- gcc/cgraphclones.c(revision 210180)
 +++ gcc/cgraphclones.c(working copy)
 @@ -435,6 +435,11 @@ cgraph_clone_node (struct cgraph_node *n, tree dec
  }
else
  count_scale = 0;
 +  /* In AutoFDO, if edge count is larger than callee's entry block
 + count, we will not update the original callee because it may
 + mistakenly mark some hot function as cold.  */
 +  if (flag_auto_profile  count = n-count)
 +update_original = false;

 This looks like a hack. Lets go with it later - I see what you are shooting 
 for here
 and non-autoFDO has kind of same problem, too.  Counts are not at all that 
 reliable
 after some inlining.

 I wonder how this hack would fare for -fprofile-use

It does not affect -fprofile-use performance in my testing. However,
for AutoFDO, this happens much more often.

 @@ -1286,6 +1285,9 @@ del_node_map (void)
  struct cgraph_node*
  find_func_by_profile_id (int profile_id)
  {
 +  if (flag_auto_profile)
 +return cgraph_node_for_asm (
 + get_identifier ((const char *)(size_t)profile_id));

 I think with LTO we will need to make this machinery safe for symbol 
 mangling.  Any plans on this?

No plan yet. Let's get AutoFDO in first, and spend some separate
effort to make AutoFDO+LTO work?

 Index: gcc/ipa-inline.h
 ===
 --- gcc/ipa-inline.h  (revision 210180)
 +++ gcc/ipa-inline.h  (working copy)
 @@ -345,3 +345,5 @@ reset_edge_growth_cache (struct cgraph_edge *edge)
edge_growth_cache[edge-uid] = zero;
  }
  }
 +
 +unsigned int early_inliner (function *fun);

 Can't this be handled by scheduling early inliner as subpass of autofdo or 
 something similar?
 I would rather make this explicit in PM.

See my comment below in auto-profile.c

 Index: gcc/auto-profile.c
 ===
 --- gcc/auto-profile.c(revision 0)
 +++ gcc/auto-profile.c(revision 0)
 +   After the above 3 phases, all profile is readily annotated on the GCC IR.
 +   AutoFDO tries to reuse all FDO infrastructure as much as possible to make
 +   use of the profile. E.g. it uses existing mechanism to calculate the 
 basic
 +   block/edge frequency, as well as the cgraph node/edge count.
 +*/

 I suppose the phases can be made explicit to the PM, so early inliner can stay
 a pass.


Unfortunately, this is more complicated:

while(changed) {
  vpt();
  einline();
  annotate();
}


 +/* Store a string array, indexed by string position in the array.  */
 +class string_table {
 +public:
 +  static string_table *create ();
 +
 +  /* For a given string, returns its index.  */
 +  int get_index (const char *name) const;
 +
 +  /* For a given decl, returns the index of the decl name.  */
 +  int get_index_by_decl (tree decl) const;
 +
 +  /* For a given index, returns the string.  */
 +  const char 

Re: [PATCH] AutoFDO patch for trunk

2014-05-15 Thread Jan Hubicka
 Hi,
 
 I'm planning to port the AutoFDO patch upstream. Attached is the
 prepared patch. You can also find the patch in
 http://codereview.appspot.com/99010043
 
 I've tested the patch with SPECCPU2006. For the CINT2006 benchmarks,
 the speedup comparison between O2, FDO and AutoFDO is as follows:
 
 Reference: o2
 (1): auto_fdo
 (2): fdo
 
Benchmark Base:Reference(1)  (2)
 -
 spec/2006/int/C++/471.omnetpp 23.18   +3.11%   +5.09%
 spec/2006/int/C++/473.astar   21.15   +6.79%   +9.80%
 spec/2006/int/C++/483.xalancbmk   36.68  +11.56%  +14.47%
 spec/2006/int/C/400.perlbench 34.57   +6.59%  +18.56%
 spec/2006/int/C/401.bzip2 23.17   +0.95%   +2.49%
 spec/2006/int/C/403.gcc   32.33   +8.27%   +9.76%
 spec/2006/int/C/429.mcf   42.13   +4.72%   +5.23%
 spec/2006/int/C/445.gobmk 26.53   -1.39%   +0.05%
 spec/2006/int/C/456.hmmer 23.72   +7.12%   +7.87%
 spec/2006/int/C/458.sjeng 26.17   +4.65%   +6.04%
 spec/2006/int/C/462.libquantum57.23   +4.04%   +1.42%
 spec/2006/int/C/464.h264ref46.3   +1.07%   +8.97%
 
 geometric mean+4.73%   +7.36%

The results are very nice indeed. So basically it adds another 50% to
static estimation, that is not bad.

The patch is long and missing a changelog, I will add my comments and
I think it would be great to break it up where possible - for some
infrastructure preparation patches and the actual reading infrastructure.

 Index: gcc/cfghooks.c
 ===
 --- gcc/cfghooks.c(revision 210180)
 +++ gcc/cfghooks.c(working copy)
 @@ -833,6 +833,9 @@ make_forwarder_block (basic_block bb, bool (*redir
  
fallthru = split_block_after_labels (bb);
dummy = fallthru-src;
 +  dummy-count = 0;
 +  dummy-frequency = 0;
 +  fallthru-count = 0;
bb = fallthru-dest;
  
/* Redirect back edges we want to keep.  */
 @@ -842,20 +845,13 @@ make_forwarder_block (basic_block bb, bool (*redir
  
if (redirect_edge_p (e))
   {
 +   dummy-frequency += EDGE_FREQUENCY (e);
 +   dummy-count += e-count;
 +   fallthru-count += e-count;
 ei_next (ei);
 continue;
   }
  
 -  dummy-frequency -= EDGE_FREQUENCY (e);
 -  dummy-count -= e-count;
 -  if (dummy-frequency  0)
 - dummy-frequency = 0;
 -  if (dummy-count  0)
 - dummy-count = 0;
 -  fallthru-count -= e-count;
 -  if (fallthru-count  0)
 - fallthru-count = 0;
 -

Can you elaborate why these changes are needed?  They look unrelated,
so it is a bug in forwarder block profile updating?
I do not see why source of the edge needs to have profile cleaned.
 Index: gcc/loop-unroll.c
 ===
 --- gcc/loop-unroll.c (revision 210180)
 +++ gcc/loop-unroll.c (working copy)
 @@ -998,6 +998,13 @@ decide_unroll_runtime_iterations (struct loop *loo
return;
  }
  
 +  /* In AutoFDO, the profile is not accurate. If the calculated trip count
 + is larger than the header count, then the profile is not accurate
 + enough to make correct unroll decisions. */
 +  if (flag_auto_profile
 +   expected_loop_iterations (loop)  loop-header-count)
 +return;

This is another independent change.  It does make sense, but I would
preffer to have it in a separate function (expected_loop_iterations_reliable_p)
rather than inline in the code.  

There are other loop optimizations that are driven by this, and they should
consistently use this predicate.

In fact current -fprofile-use enabled -funroll-loops flag that enables all
unrolling independently of reliablility of the profile.  I suppose we need to
imporove this scheme and have something liek unroll-loop-with-reliable-profile
path that would default for profile feedback optimization.

I also do not think your check makes a lot of sense.  You want to compare loop
header count and loop preheader or so, or this is not reliable enough for
auto-FDO profiles?

In any case, lets handle this separately of the main autoFDO patches (and not
forget on it).

 Index: gcc/dwarf2out.c
 ===
 --- gcc/dwarf2out.c   (revision 210180)
 +++ gcc/dwarf2out.c   (working copy)
 @@ -2481,6 +2481,40 @@ const struct gcc_debug_hooks dwarf2_debug_hooks =
1,/* start_end_main_source_file */
TYPE_SYMTAB_IS_DIE/* tree_type_symtab_field */
  };
 +
 +const struct gcc_debug_hooks auto_profile_debug_hooks =
 +{
 +  debug_nothing_charstar,
 +  debug_nothing_charstar,
 +  debug_nothing_void,
 +  debug_nothing_int_charstar,
 +  debug_nothing_int_charstar,
 +  debug_nothing_int_charstar,
 +  debug_nothing_int,
 +  

Re: [PATCH] AutoFDO patch for trunk

2014-05-15 Thread Jan Hubicka
 Index: gcc/auto-profile.c
 ===
 --- gcc/auto-profile.c(revision 0)
 +++ gcc/auto-profile.c(revision 0)
 @@ -0,0 +1,1584 @@
 +/* Calculate branch probabilities, and basic block execution counts.

Update the toplevel comment, too.

 +
 +/* The following routines implements AutoFDO optimization.
 +
 +   This optimization uses sampling profiles to annotate basic block counts
 +   and uses heuristics to estimate branch probabilities.
 +
 +   There are three phases in AutoFDO:
 +
 +   Phase 1: Read profile from the profile data file.
 + The following info is read from the profile datafile:
 + * string_table: a map between function name and its index.
 + * autofdo_source_profile: a map from function_instance name to
 +   function_instance. This is represented as a forest of
 +   function_instances.
 + * WorkingSet: a histogram of how many instructions are covered for a
 + given percentage of total cycles.
 +
 +   Phase 2: Early inline.
 + Early inline uses autofdo_source_profile to find if a callsite is:
 + * inlined in the profiled binary.
 + * callee body is hot in the profiling run.
 + If both condition satisfies, early inline will inline the callsite
 + regardless of the code growth.
 +
 +   Phase 3: Annotate control flow graph.
 + AutoFDO uses a separate pass to:
 + * Annotate basic block count
 + * Estimate branch probability
 +
 +   After the above 3 phases, all profile is readily annotated on the GCC IR.
 +   AutoFDO tries to reuse all FDO infrastructure as much as possible to make
 +   use of the profile. E.g. it uses existing mechanism to calculate the basic
 +   block/edge frequency, as well as the cgraph node/edge count.
 +*/

I suppose the phases can be made explicit to the PM, so early inliner can stay
a pass.
 +/* Store a string array, indexed by string position in the array.  */
 +class string_table {
 +public:
 +  static string_table *create ();
 +
 +  /* For a given string, returns its index.  */
 +  int get_index (const char *name) const;
 +
 +  /* For a given decl, returns the index of the decl name.  */
 +  int get_index_by_decl (tree decl) const;
 +
 +  /* For a given index, returns the string.  */
 +  const char *get_name (int index) const;

I wonder how these should work with LTO.  I suppose we can tell user to not LTO
for train run, we should teach the table to ignore LTO generated suffixes as
well as random seeds (like coverage code already does).

What happens when two static functions have same name?

 +/* Profile for all functions.  */
 +class autofdo_source_profile {
 +public:
 +  static autofdo_source_profile *create ()
 +{
 +  autofdo_source_profile *map = new autofdo_source_profile ();

I think we still want empty lines here.  Otherwise I trust you on following GCC 
C++ coding standards :)
 +  if (map-read ())
 + return map;
 +  delete map;
 +  return NULL;
 +}

 +/* Helper functions.  */
 +
 +/* Return the original name of NAME: strip the suffix that starts
 +   with '.'  */
 +
 +static const char *get_original_name (const char *name)
 +{
 +  char *ret = xstrdup (name);
 +  char *find = strchr (ret, '.');
 +  if (find != NULL)
 +*find = 0;
 +  return ret;
 +}

Stripping all suffixes will likely make conflicts on static functions, right?
Example like this will get you a conflict
m()
{ 
void t(void)
{
}
t();
}
m2()
{
  void t2(void)
{
}
t2();
}

that is of course not that important given that nested functions are not top 
priority
but it would be nice to handle it gratefuly.
I would definitely merge logic here with logic in gcov that already knows how 
to skip
random seeds and other cases that makes filenames to diverge...
 +
 +/* Return the combined location, which is a 32bit integer in which
 +   higher 16 bits stores the line offset of LOC to the start lineno
 +   of DECL, The lower 16 bits stores the discrimnator.  */
 +
 +static unsigned
 +get_combined_location (location_t loc, tree decl)
 +{
 +  return ((LOCATION_LINE (loc) - DECL_SOURCE_LINE (decl))  16);
 +}

What happens on overflows here? I guess they are possible - 65536 lines is not 
infinity
these days.
The discriminator is added later?
 +
 +int
 +string_table::get_index (const char *name) const
 +{
 +  if (name == NULL)
 +return -1;
 +  string_index_map::const_iterator iter = map_.find (name);
 +  if (iter == map_.end())
 +return -1;
 +  else
 +return iter-second;
 +}
 +
 +int
 +string_table::get_index_by_decl (tree decl) const

Comments missing here.
 +
 +/* Read the profile and create a function_instance with head count as
 +   HEAD_COUNT. Recursively read callsites to create nested function_instances
 +   too. STACK is used to track the recursive creation process.  */
 +
 +function_instance *
 +function_instance::read_function_instance (
 +function_instance_stack *stack, gcov_type head_count)
 +{
 +  unsigned name = 

Re: [PATCH] AutoFDO patch for trunk

2014-05-07 Thread Xinliang David Li
Have you announced the autofdo profile tool to gcc list?

David

On Wed, May 7, 2014 at 2:24 PM, Dehao Chen de...@google.com wrote:
 Hi,

 I'm planning to port the AutoFDO patch upstream. Attached is the
 prepared patch. You can also find the patch in
 http://codereview.appspot.com/99010043

 I've tested the patch with SPECCPU2006. For the CINT2006 benchmarks,
 the speedup comparison between O2, FDO and AutoFDO is as follows:

 Reference: o2
 (1): auto_fdo
 (2): fdo

Benchmark Base:Reference(1)  (2)
 -
 spec/2006/int/C++/471.omnetpp 23.18   +3.11%   +5.09%
 spec/2006/int/C++/473.astar   21.15   +6.79%   +9.80%
 spec/2006/int/C++/483.xalancbmk   36.68  +11.56%  +14.47%
 spec/2006/int/C/400.perlbench 34.57   +6.59%  +18.56%
 spec/2006/int/C/401.bzip2 23.17   +0.95%   +2.49%
 spec/2006/int/C/403.gcc   32.33   +8.27%   +9.76%
 spec/2006/int/C/429.mcf   42.13   +4.72%   +5.23%
 spec/2006/int/C/445.gobmk 26.53   -1.39%   +0.05%
 spec/2006/int/C/456.hmmer 23.72   +7.12%   +7.87%
 spec/2006/int/C/458.sjeng 26.17   +4.65%   +6.04%
 spec/2006/int/C/462.libquantum57.23   +4.04%   +1.42%
 spec/2006/int/C/464.h264ref46.3   +1.07%   +8.97%

 geometric mean+4.73%   +7.36%

 The majority of the performance difference between AutoFDO and FDO
 comes from the lack of instruction level discriminator support. Cary
 Coutant is planning to port that patch upstream too.

 Please let me know if you have any question about this patch, and
 thanks in advance for reviewing such a huge patch.

 Dehao