Re: [PATCH] AutoFDO patch for trunk
Dehao Chen de...@google.com writes: The updated patch attached. Will commit the patch in 2~3 hours if no objection is received. Apart from the AIX bootstrap failure your patch introduced, it also breaks Solaris bootstrap: In file included from ./config.h:6:0, from /vol/gcc/src/hg/trunk/local/gcc/auto-profile.c:25: ./auto-host.h:1976:0: error: _FILE_OFFSET_BITS redefined [-Werror] #define _FILE_OFFSET_BITS 64 ^ In file included from /usr/include/iso/string_iso.h:24:0, from /usr/include/string.h:11, from /vol/gcc/src/hg/trunk/local/gcc/auto-profile.c:21: /var/gcc/regression/trunk/11-gcc/build/prev-gcc/include-fixed/sys/feature_tests.h:213:0: note: this is the location of the previous definition #define _FILE_OFFSET_BITS 32 ^ As Joseph is repeating over and over again, *nothing* must be included before config.h, and auto-profile.c violates this. The following patch at least allows the file to compile without errors; no idea if this the best order for the headers involved. diff --git a/gcc/auto-profile.c b/gcc/auto-profile.c --- a/gcc/auto-profile.c +++ b/gcc/auto-profile.c @@ -18,12 +18,13 @@ You should have received a copy of the G along with GCC; see the file COPYING3. If not see http://www.gnu.org/licenses/. */ +#include config.h +#include system.h + #include string.h #include map #include set -#include config.h -#include system.h #include coretypes.h #include tree.h #include tree-pass.h Rainer -- - Rainer Orth, Center for Biotechnology, Bielefeld University
Re: [PATCH] AutoFDO patch for trunk
Rainer Orth writes: As Joseph is repeating over and over again, *nothing* must be included before config.h, and auto-profile.c violates this. The following patch at least allows the file to compile without errors; no idea if this the best order for the headers involved. The patch also fixes the problem on AIX. Thanks, David
Re: [PATCH] AutoFDO patch for trunk
Can someone pre-approve the patch so that Dehao can check it in after basic testing? David On Wed, Oct 22, 2014 at 10:06 AM, David Edelsohn dje@gmail.com wrote: Rainer Orth writes: As Joseph is repeating over and over again, *nothing* must be included before config.h, and auto-profile.c violates this. The following patch at least allows the file to compile without errors; no idea if this the best order for the headers involved. The patch also fixes the problem on AIX. Thanks, David
Re: [PATCH] AutoFDO patch for trunk
The patch tested OK. And I think it's a trivial patch, and already committed it to trunk. About the perf parser. I'm syncing the toolchain to head which should already have newer kernel support. Thanks, Dehao On Wed, Oct 22, 2014 at 10:07 AM, Xinliang David Li davi...@google.com wrote: Can someone pre-approve the patch so that Dehao can check it in after basic testing? David On Wed, Oct 22, 2014 at 10:06 AM, David Edelsohn dje@gmail.com wrote: Rainer Orth writes: As Joseph is repeating over and over again, *nothing* must be included before config.h, and auto-profile.c violates this. The following patch at least allows the file to compile without errors; no idea if this the best order for the headers involved. The patch also fixes the problem on AIX. Thanks, David
Re: [PATCH] AutoFDO patch for trunk
On 2014.10.20 at 14:21 -0700, Dehao Chen wrote: +If @var{path} is specified, GCC looks at the @var{path} to find +the profile feedback data files. + +In order to collect AutoFDO profile, you need to have: + +1. A linux system with linux perf support +2. An Intel processor with last branch record (LBR) support Is it really a requirement or will it only make profiles more accurate? It's optional, updated the doc. Can you give an example on how this is supposed to work on non Intel CPUs? It is not obvious to me. -- Markus
Re: [PATCH] AutoFDO patch for trunk
Everything will be the same on non-intel CPUs except for the perf command: perf record -e instructions -- your program. i.e. you need to drop -b and use instructions as event. Note that the current algorithm is tuned for accurate instruction level profile, which is not available on non-Intel CPU. But you are more than welcome to tune the propagation algorithm to get most out of inaccurate instruction profile. Cheers, Dehao On Tue, Oct 21, 2014 at 12:30 PM, Markus Trippelsdorf mar...@trippelsdorf.de wrote: On 2014.10.20 at 14:21 -0700, Dehao Chen wrote: +If @var{path} is specified, GCC looks at the @var{path} to find +the profile feedback data files. + +In order to collect AutoFDO profile, you need to have: + +1. A linux system with linux perf support +2. An Intel processor with last branch record (LBR) support Is it really a requirement or will it only make profiles more accurate? It's optional, updated the doc. Can you give an example on how this is supposed to work on non Intel CPUs? It is not obvious to me. -- Markus
Re: [PATCH] AutoFDO patch for trunk
On 2014.10.21 at 13:53 -0700, Dehao Chen wrote: Everything will be the same on non-intel CPUs except for the perf command: perf record -e instructions -- your program. i.e. you need to drop -b and use instructions as event. Note that the current algorithm is tuned for accurate instruction level profile, which is not available on non-Intel CPU. But you are more than welcome to tune the propagation algorithm to get most out of inaccurate instruction profile. Thanks. But unfortunately it doesn't work: markus@x4 autofdo % g++ -O2 -g /home/markus/bench.cpp markus@x4 autofdo % perf record -e instructions ./a.out ... markus@x4 autofdo % ./create_gcov --binary=a.out --profile=perf.data --gcov=profile.afdo E1021 23:23:30.733412 13689 perf_reader.cc:918] Unsupported event type 10 F1021 23:23:30.733707 13689 perf_parser.cc:114] Check failed: ReadPerfSampleInfo(**parsed_events_[i].raw_event, sample_info) *** Check failure stack trace: *** @ 0x44469d google::LogMessage::Flush() @ 0x4470b9 google::LogMessageFatal::~LogMessageFatal() @ 0x43c668 quipper::PerfParser::SortParsedEvents() @ 0x43f0cb quipper::PerfParser::ParseRawEvents() @ 0x4221ef autofdo::PerfDataSampleReader::Append() @ 0x421af3 autofdo::SampleReader::ReadAndSetMaxCount() @ 0x417e61 autofdo::ProfileCreator::ReadSample() @ 0x418db2 autofdo::ProfileCreator::CreateProfile() @ 0x4086a5 main @ 0x7fa849ce8fd0 __libc_start_main @ 0x4099cb (unknown) @ (nil) (unknown) [1]13689 abort ./create_gcov --binary=a.out --profile=perf.data --gcov=profile.afdo -- Markus
Re: [PATCH] AutoFDO patch for trunk
Looks like the perf data type is incompatible with quipper (perf data parser). Can you send me the perf.data file so that I can take a look. Thanks, Dehao On Tue, Oct 21, 2014 at 2:25 PM, Markus Trippelsdorf mar...@trippelsdorf.de wrote: On 2014.10.21 at 13:53 -0700, Dehao Chen wrote: Everything will be the same on non-intel CPUs except for the perf command: perf record -e instructions -- your program. i.e. you need to drop -b and use instructions as event. Note that the current algorithm is tuned for accurate instruction level profile, which is not available on non-Intel CPU. But you are more than welcome to tune the propagation algorithm to get most out of inaccurate instruction profile. Thanks. But unfortunately it doesn't work: markus@x4 autofdo % g++ -O2 -g /home/markus/bench.cpp markus@x4 autofdo % perf record -e instructions ./a.out ... markus@x4 autofdo % ./create_gcov --binary=a.out --profile=perf.data --gcov=profile.afdo E1021 23:23:30.733412 13689 perf_reader.cc:918] Unsupported event type 10 F1021 23:23:30.733707 13689 perf_parser.cc:114] Check failed: ReadPerfSampleInfo(**parsed_events_[i].raw_event, sample_info) *** Check failure stack trace: *** @ 0x44469d google::LogMessage::Flush() @ 0x4470b9 google::LogMessageFatal::~LogMessageFatal() @ 0x43c668 quipper::PerfParser::SortParsedEvents() @ 0x43f0cb quipper::PerfParser::ParseRawEvents() @ 0x4221ef autofdo::PerfDataSampleReader::Append() @ 0x421af3 autofdo::SampleReader::ReadAndSetMaxCount() @ 0x417e61 autofdo::ProfileCreator::ReadSample() @ 0x418db2 autofdo::ProfileCreator::CreateProfile() @ 0x4086a5 main @ 0x7fa849ce8fd0 __libc_start_main @ 0x4099cb (unknown) @ (nil) (unknown) [1]13689 abort ./create_gcov --binary=a.out --profile=perf.data --gcov=profile.afdo -- Markus
Re: [PATCH] AutoFDO patch for trunk
On 2014.10.21 at 15:31 -0700, Dehao Chen wrote: Looks like the perf data type is incompatible with quipper (perf data parser). Can you send me the perf.data file so that I can take a look. PERF_RECORD_MMAP2 (aka 10) was added in Linux 3.12 (commit 13d7a2410f). So your autofdo tool simply doesn't work on newer kernels. -- Markus
Re: [PATCH] AutoFDO patch for trunk
The updated patch attached. Will commit the patch in 2~3 hours if no objection is received. Thanks, Dehao On Sun, Oct 19, 2014 at 2:58 AM, Jan Hubicka hubi...@ucw.cz wrote: +/* Member functions for string_table. */ + +string_table * +string_table::create () Why this is not a constructor? We use static initializer because it's not suggested to put too much logic in constructor. Why not? :) You're right. Updated to use constructor instead. +} The two hunks probably can be unified. Why get_index_by_decl does not already use dwarf name and why do you need to consider both? get_index_by_decl is actually a wrapper of get_index. It tolerates error when assembler name is not emitted in the debug binary (mostly for functions that are fully inlined). In this case, we will first try to find the index of the assembler name, if not found, then bfd name. If still not found, then we try to find name from its abstract location. Yep, I am just somewhat concerned that you need to try different variations of the name. I would expect the assembler name (minus the random seed mangling) to be sufficient. In theory, assembler name should be good enough. But in practice, we still see that for some functions, assembler name is not available. This function purely exists to tolerate faults like this. And should be simplified once debug info becomes perfect. + if (DECL_ABSTRACT_ORIGIN (decl)) +return get_function_instance_by_decl (lineno, DECL_ABSTRACT_ORIGIN (decl)); What really happens for ipa-cp function clones and for split functions? Their name suffix will be stripped before matching names. I see and profile merged, that seems resonable. I am not sure it is a win to have the things represented as VPT histograms when you shoot for quite special purpose problem. But lets deal with these incrementally, too. (in general I do not see why you try to share so much with the gcov based VTP - it seems easier to do the speculation by hand) Basically there are 2 problem: * before annotation. I agree this is a special purpose problem * after annotation. This should be the same problem as VPT. Yep, I think adding VPT histograms to get devirtualization happen inter-procedurally with LTO is a good idea. You will need (incrementally I guess) to translate your indexes into the profile-id used or you will need to initialize profile-ids accordingly, so the IPA pass is able to identify the target cross-module. Sure, let's do this incrementally. But for the symbol look up, we actually use func_name instead of profile-id. So I guess we don't have the problem you mentioned here? (except that function name my collide). +{ + if (gcov_open (auto_profile_file, 1) == 0) +error (Cannot open profile file %s., auto_profile_file); + + if (gcov_read_unsigned () != GCOV_DATA_MAGIC) +error (AutoFDO profile magic number does not mathch.); + + /* Skip the version number. */ + gcov_read_unsigned (); + + /* Skip the empty integer. */ + gcov_read_unsigned (); Perhaps we can just check the values? What do you mean? Check the value against 0? Yes, if you have version in there it seems to make sense to check it ;) Done I guess it is there so you can add extra stuff into the file format that will make it incompatible with current one, so probably we should reject all versions greater than 0. (especially because the tool is off-tree) Does this have chance to work inter-module? Yes, it works fine with LIPO. Important (for me) is to make it work with LTO, because LIPO is apparently not landing to 5.0. I think we only need to handle function name colliding case. This could be simply be resolved by coupling function name with file name. Yes, this could be done. After a second thought, I think we actually need to expose einliner interface to autofdo (instead of doing vpt in early inliner). This is because we need to mark icall as promoted, so that later vpt passes will not try to promote it again. Currently in AutoFDO, we use a self-contained way (use a set to mark all promoted stmts). If we do vpt in einline, then we will need some flag attached to gimple to indicate whether it's already promoted. If we manage to remove the iteration loop that repeativly applies the inline plan, you only need way to mark edge as promoted. I guess you can just test the speculative flag on the edge to skip ones you dealt with earlier. Lets do that incrementally though. Agree, let's do it incrementally. + +@item -fauto-profile +@itemx -fauto-profile=@var{path} +@opindex fauto-profile +Enable sampling based feedback directed optimizations, and optimizations +generally profitable only with profile feedback available. + +The following options are enabled: @code{-fbranch-probabilities}, @code{-fvpt}, +@code{-funroll-loops}, @code{-fpeel-loops}, @code{-ftracer}, @code{-ftree-vectorize},
Re: [PATCH] AutoFDO patch for trunk
+/* Member functions for string_table. */ + +string_table * +string_table::create () Why this is not a constructor? We use static initializer because it's not suggested to put too much logic in constructor. Why not? :) +} The two hunks probably can be unified. Why get_index_by_decl does not already use dwarf name and why do you need to consider both? get_index_by_decl is actually a wrapper of get_index. It tolerates error when assembler name is not emitted in the debug binary (mostly for functions that are fully inlined). In this case, we will first try to find the index of the assembler name, if not found, then bfd name. If still not found, then we try to find name from its abstract location. Yep, I am just somewhat concerned that you need to try different variations of the name. I would expect the assembler name (minus the random seed mangling) to be sufficient. + if (DECL_ABSTRACT_ORIGIN (decl)) +return get_function_instance_by_decl (lineno, DECL_ABSTRACT_ORIGIN (decl)); What really happens for ipa-cp function clones and for split functions? Their name suffix will be stripped before matching names. I see and profile merged, that seems resonable. I am not sure it is a win to have the things represented as VPT histograms when you shoot for quite special purpose problem. But lets deal with these incrementally, too. (in general I do not see why you try to share so much with the gcov based VTP - it seems easier to do the speculation by hand) Basically there are 2 problem: * before annotation. I agree this is a special purpose problem * after annotation. This should be the same problem as VPT. Yep, I think adding VPT histograms to get devirtualization happen inter-procedurally with LTO is a good idea. You will need (incrementally I guess) to translate your indexes into the profile-id used or you will need to initialize profile-ids accordingly, so the IPA pass is able to identify the target cross-module. +{ + if (gcov_open (auto_profile_file, 1) == 0) +error (Cannot open profile file %s., auto_profile_file); + + if (gcov_read_unsigned () != GCOV_DATA_MAGIC) +error (AutoFDO profile magic number does not mathch.); + + /* Skip the version number. */ + gcov_read_unsigned (); + + /* Skip the empty integer. */ + gcov_read_unsigned (); Perhaps we can just check the values? What do you mean? Check the value against 0? Yes, if you have version in there it seems to make sense to check it ;) I guess it is there so you can add extra stuff into the file format that will make it incompatible with current one, so probably we should reject all versions greater than 0. (especially because the tool is off-tree) Does this have chance to work inter-module? Yes, it works fine with LIPO. Important (for me) is to make it work with LTO, because LIPO is apparently not landing to 5.0. Yes, this could be done. After a second thought, I think we actually need to expose einliner interface to autofdo (instead of doing vpt in early inliner). This is because we need to mark icall as promoted, so that later vpt passes will not try to promote it again. Currently in AutoFDO, we use a self-contained way (use a set to mark all promoted stmts). If we do vpt in einline, then we will need some flag attached to gimple to indicate whether it's already promoted. If we manage to remove the iteration loop that repeativly applies the inline plan, you only need way to mark edge as promoted. I guess you can just test the speculative flag on the edge to skip ones you dealt with earlier. Lets do that incrementally though. + +@item -fauto-profile +@itemx -fauto-profile=@var{path} +@opindex fauto-profile +Enable sampling based feedback directed optimizations, and optimizations +generally profitable only with profile feedback available. + +The following options are enabled: @code{-fbranch-probabilities}, @code{-fvpt}, +@code{-funroll-loops}, @code{-fpeel-loops}, @code{-ftracer}, @code{-ftree-vectorize}, +@code{ftree-loop-distribute-patterns} I wonder about loop peeling - we do not enale it by default because we do not believe our static branch predictor can well identify loops that have low expected trip count (except ones that can be fully inlined). I am not sure auto-FDO is much better here - how reliable the info about number of iteratoins is? But this is again something that can be handled separately. Documentation seems out of date though - it seems to enable also -finline, -fipa-cp and cloning, predictive commoning, unswitching, gcse-after-reload, You do not want to enable reorder-functions because autoFDO does not include the time profiler. Also I think speculative devirtualization should not be disabled. + +If @var{path} is specified, GCC looks at the @var{path} to find +the profile feedback data files. + +In order to collect AutoFDO profile, you need to
Re: [PATCH] AutoFDO patch for trunk
Hi, Honza, I've integrated all your comments to the patch. New patch attached. Thanks, Dehao On Wed, Oct 15, 2014 at 7:28 AM, Jan Hubicka hubi...@ucw.cz wrote: Index: gcc/cfgloop.c === --- gcc/cfgloop.c (revision 215826) +++ gcc/cfgloop.c (working copy) @@ -1802,7 +1802,7 @@ record_niter_bound (struct loop *loop, const wides } if (realistic (!loop-any_estimate - || wi::ltu_p (i_bound, loop-nb_iterations_estimate))) + || (!flag_auto_profile wi::ltu_p (i_bound, loop-nb_iterations_estimate I do not remember this hunk from previous patches. So you make autofdo to drop the conditional that new estimate must be smaller than preivously derived. I suppose it is because you want autofdo estimates to be overwritten by anything determined later. Lets drop this hunk now, too. I think we really want to have way to track if the estimate is from autofdo and thus not too reliable but better than nothing and we may want to introduce some kind of reliability metric on way estimates are derived. Lets handle this incrementally and drop this change from this patch. Done. Index: gcc/doc/invoke.texi === --- gcc/doc/invoke.texi (revision 215826) +++ gcc/doc/invoke.texi (working copy) @@ -365,7 +365,8 @@ Objective-C and Objective-C++ Dialects}. @gccoptlist{-faggressive-loop-optimizations -falign-functions[=@var{n}] @gol -falign-jumps[=@var{n}] @gol -falign-labels[=@var{n}] -falign-loops[=@var{n}] @gol --fassociative-math -fauto-inc-dec -fbranch-probabilities @gol +-fassociative-math -fauto-profile -fauto-profile[=@var{path}] @gol +-fauto-inc-dec -fbranch-probabilities @gol -fbranch-target-load-optimize -fbranch-target-load-optimize2 @gol -fbtr-bb-exclusive -fcaller-saves @gol -fcheck-data-deps -fcombine-stack-adjustments -fconserve-stack @gol @@ -9164,6 +9165,19 @@ code. If @var{path} is specified, GCC looks at the @var{path} to find the profile feedback data files. See @option{-fprofile-dir}. + +@item -fauto-profile +@itemx -fauto-profile=@var{path} +@opindex fauto-profile +Enable sampling based feedback directed optimizations, and optimizations +generally profitable only with profile feedback available. + +The following options are enabled: @code{-fbranch-probabilities}, @code{-fvpt}, +@code{-funroll-loops}, @code{-fpeel-loops}, @code{-ftracer}, @code{-ftree-vectorize}, +@code{ftree-loop-distribute-patterns} + +If @var{path} is specified, GCC looks at the @var{path} to find +the profile feedback data files. I agree with Andi that you need to write more here - basically it should be clear how to profile the application and use the data. We do use external tools here (oprofile/gperf/non-Linux tools I suppose) plus an conversion tool. So you need to provide some references to those. Actually I think it would be really nice to add tutorial-like chapter into GCC manual about using FDO/autoFDO to improve your code or even more general aimed covering advanced topics like correct LTO use, profile feedback and other fun. Done, added some little example to demonstrate how to collect profile. Index: gcc/value-prof.c === --- gcc/value-prof.c (revision 215826) +++ gcc/value-prof.c (working copy) @@ -134,11 +134,10 @@ static bool gimple_divmod_fixed_value_transform (g static bool gimple_mod_pow2_value_transform (gimple_stmt_iterator *); static bool gimple_mod_subtract_transform (gimple_stmt_iterator *); static bool gimple_stringops_transform (gimple_stmt_iterator *); -static bool gimple_ic_transform (gimple_stmt_iterator *); /* Allocate histogram value. */ -static histogram_value +histogram_value gimple_alloc_histogram_value (struct function *fun ATTRIBUTE_UNUSED, enum hist_type type, gimple stmt, tree value) Do we still need to expor those? Yes, as soon as we are faking histogram. { @@ -1307,6 +1306,9 @@ del_node_map (void) struct cgraph_node* find_func_by_profile_id (int profile_id) { + if (flag_auto_profile) +return cgraph_node::get_for_asmname ( + get_identifier ((const char *)(size_t)profile_id)); This definitly needs an comment. Why you need to use find_func_by_profile_id on both gcov profiling and auto profiling paths? Perhaps you can use your own fnuction here? Or perhaps this is not necessary if you implement directly the indirect call optimiization? Done We may try to head towards of having way to read both coverage and auto profile into compiler eventually. It may be useful to get extra precision from gcov profile and to measure data like cache misses with auto profile. cgraph_node **val = cgraph_node_map-get (profile_id); if (val) return *val; @@ -1320,7 +1322,7 @@
Re: [PATCH] AutoFDO patch for trunk
Index: gcc/cfgloop.c === --- gcc/cfgloop.c (revision 215826) +++ gcc/cfgloop.c (working copy) @@ -1802,7 +1802,7 @@ record_niter_bound (struct loop *loop, const wides } if (realistic (!loop-any_estimate - || wi::ltu_p (i_bound, loop-nb_iterations_estimate))) + || (!flag_auto_profile wi::ltu_p (i_bound, loop-nb_iterations_estimate I do not remember this hunk from previous patches. So you make autofdo to drop the conditional that new estimate must be smaller than preivously derived. I suppose it is because you want autofdo estimates to be overwritten by anything determined later. Lets drop this hunk now, too. I think we really want to have way to track if the estimate is from autofdo and thus not too reliable but better than nothing and we may want to introduce some kind of reliability metric on way estimates are derived. Lets handle this incrementally and drop this change from this patch. Index: gcc/doc/invoke.texi === --- gcc/doc/invoke.texi (revision 215826) +++ gcc/doc/invoke.texi (working copy) @@ -365,7 +365,8 @@ Objective-C and Objective-C++ Dialects}. @gccoptlist{-faggressive-loop-optimizations -falign-functions[=@var{n}] @gol -falign-jumps[=@var{n}] @gol -falign-labels[=@var{n}] -falign-loops[=@var{n}] @gol --fassociative-math -fauto-inc-dec -fbranch-probabilities @gol +-fassociative-math -fauto-profile -fauto-profile[=@var{path}] @gol +-fauto-inc-dec -fbranch-probabilities @gol -fbranch-target-load-optimize -fbranch-target-load-optimize2 @gol -fbtr-bb-exclusive -fcaller-saves @gol -fcheck-data-deps -fcombine-stack-adjustments -fconserve-stack @gol @@ -9164,6 +9165,19 @@ code. If @var{path} is specified, GCC looks at the @var{path} to find the profile feedback data files. See @option{-fprofile-dir}. + +@item -fauto-profile +@itemx -fauto-profile=@var{path} +@opindex fauto-profile +Enable sampling based feedback directed optimizations, and optimizations +generally profitable only with profile feedback available. + +The following options are enabled: @code{-fbranch-probabilities}, @code{-fvpt}, +@code{-funroll-loops}, @code{-fpeel-loops}, @code{-ftracer}, @code{-ftree-vectorize}, +@code{ftree-loop-distribute-patterns} + +If @var{path} is specified, GCC looks at the @var{path} to find +the profile feedback data files. I agree with Andi that you need to write more here - basically it should be clear how to profile the application and use the data. We do use external tools here (oprofile/gperf/non-Linux tools I suppose) plus an conversion tool. So you need to provide some references to those. Actually I think it would be really nice to add tutorial-like chapter into GCC manual about using FDO/autoFDO to improve your code or even more general aimed covering advanced topics like correct LTO use, profile feedback and other fun. Index: gcc/value-prof.c === --- gcc/value-prof.c (revision 215826) +++ gcc/value-prof.c (working copy) @@ -134,11 +134,10 @@ static bool gimple_divmod_fixed_value_transform (g static bool gimple_mod_pow2_value_transform (gimple_stmt_iterator *); static bool gimple_mod_subtract_transform (gimple_stmt_iterator *); static bool gimple_stringops_transform (gimple_stmt_iterator *); -static bool gimple_ic_transform (gimple_stmt_iterator *); /* Allocate histogram value. */ -static histogram_value +histogram_value gimple_alloc_histogram_value (struct function *fun ATTRIBUTE_UNUSED, enum hist_type type, gimple stmt, tree value) Do we still need to expor those? { @@ -1307,6 +1306,9 @@ del_node_map (void) struct cgraph_node* find_func_by_profile_id (int profile_id) { + if (flag_auto_profile) +return cgraph_node::get_for_asmname ( + get_identifier ((const char *)(size_t)profile_id)); This definitly needs an comment. Why you need to use find_func_by_profile_id on both gcov profiling and auto profiling paths? Perhaps you can use your own fnuction here? Or perhaps this is not necessary if you implement directly the indirect call optimiization? We may try to head towards of having way to read both coverage and auto profile into compiler eventually. It may be useful to get extra precision from gcov profile and to measure data like cache misses with auto profile. cgraph_node **val = cgraph_node_map-get (profile_id); if (val) return *val; @@ -1320,7 +1322,7 @@ find_func_by_profile_id (int profile_id) may ICE. Here we only do very minimal sanity check just to make compiler happy. Returns true if TARGET is considered ok for call CALL_STMT. */ -static bool +bool check_ic_target (gimple call_stmt, struct cgraph_node *target) { location_t locus; @@ -1483,7 +1485,7 @@
Re: [PATCH] AutoFDO patch for trunk
Index: gcc/cgraphclones.c === --- gcc/cgraphclones.c(revision 215826) +++ gcc/cgraphclones.c(working copy) @@ -453,6 +453,11 @@ } else count_scale = 0; + /* In AutoFDO, if edge count is larger than callee's entry block + count, we will not update the original callee because it may + mistakenly mark some hot function as cold. */ + if (flag_auto_profile gcov_count = count) +update_original = false; lets drop this from initial patch. Index: gcc/bb-reorder.c === --- gcc/bb-reorder.c (revision 215826) +++ gcc/bb-reorder.c (working copy) @@ -1569,15 +1569,14 @@ /* Mark which partition (hot/cold) each basic block belongs in. */ FOR_EACH_BB_FN (bb, cfun) { - bool cold_bb = false; + bool cold_bb = probably_never_executed_bb_p (cfun, bb); and this too (basically all the tweaks should IMO go in independently and ideally in a way that does not need flag_auto_profile test). +/* Return true if BB contains indirect call. */ + +static bool +has_indirect_call (basic_block bb) +{ + gimple_stmt_iterator gsi; + + for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (gsi)) +{ + gimple stmt = gsi_stmt (gsi); + if (gimple_code (stmt) == GIMPLE_CALL +(gimple_call_fn (stmt) == NULL + || TREE_CODE (gimple_call_fn (stmt)) != FUNCTION_DECL)) You probably want to skip gimple_call_internal_p calls here. + +/* From AutoFDO profiles, find values inside STMT for that we want to measure + histograms for indirect-call optimization. */ + +static void +afdo_indirect_call (gimple_stmt_iterator *gsi, const icall_target_map map, + bool transform) +{ + gimple stmt = gsi_stmt (*gsi); + tree callee; + + if (map.size() == 0 || gimple_code (stmt) != GIMPLE_CALL + || gimple_call_fndecl (stmt) != NULL_TREE) +return; + + callee = gimple_call_fn (stmt); + + histogram_value hist = gimple_alloc_histogram_value ( + cfun, HIST_TYPE_INDIR_CALL, stmt, callee); + hist-n_counters = 3; + hist-hvalue.counters = XNEWVEC (gcov_type, hist-n_counters); + gimple_add_histogram_value (cfun, stmt, hist); + + gcov_type total = 0; + icall_target_map::const_iterator max_iter = map.end(); + + for (icall_target_map::const_iterator iter = map.begin(); + iter != map.end(); ++iter) +{ + total += iter-second; + if (max_iter == map.end() || max_iter-second iter-second) + max_iter = iter; +} + + hist-hvalue.counters[0] = (unsigned long long) + afdo_string_table-get_name (max_iter-first); + hist-hvalue.counters[1] = max_iter-second; + hist-hvalue.counters[2] = total; + + if (!transform) +return; + + if (gimple_ic_transform (gsi)) +{ + struct cgraph_edge *indirect_edge = + cgraph_node::get (current_function_decl)-get_edge (stmt); + struct cgraph_node *direct_call = + find_func_by_profile_id ((int)hist-hvalue.counters[0]); + if (DECL_STRUCT_FUNCTION (direct_call-decl) == NULL) + return; + struct cgraph_edge *new_edge = + indirect_edge-make_speculative (direct_call, 0, 0); + new_edge-redirect_call_stmt_to_callee (); + gimple_remove_histogram_value (cfun, stmt, hist); + inline_call (new_edge, true, NULL, NULL, false); + return; +} + return; Is it necessary to go via histogram and gimple_ic_transform here? I would expect that all you need is to make the speculative edge and inline it. (bypassing the work of producing fake histogram value and calling igmple_ic_transofrm on it) Also it seems to me that you want to set direct_count nad frequency argument of make_speculative so the resulting function profile is not off. The rest of interfaces seems quite sane now. Can you please look into using speculative edges directly instead of hooking into the vpt infrastructure and fixing the formatting issues of the new pass? I will try to make another pass over the actual streaming logic that I find bit difficult to read, but I quite trust you it does the right thing ;) Honza
Re: [PATCH] AutoFDO patch for trunk
On Tue, Oct 14, 2014 at 8:02 AM, Jan Hubicka hubi...@ucw.cz wrote: Index: gcc/cgraphclones.c === --- gcc/cgraphclones.c(revision 215826) +++ gcc/cgraphclones.c(working copy) @@ -453,6 +453,11 @@ } else count_scale = 0; + /* In AutoFDO, if edge count is larger than callee's entry block + count, we will not update the original callee because it may + mistakenly mark some hot function as cold. */ + if (flag_auto_profile gcov_count = count) +update_original = false; lets drop this from initial patch. Done Index: gcc/bb-reorder.c === --- gcc/bb-reorder.c (revision 215826) +++ gcc/bb-reorder.c (working copy) @@ -1569,15 +1569,14 @@ /* Mark which partition (hot/cold) each basic block belongs in. */ FOR_EACH_BB_FN (bb, cfun) { - bool cold_bb = false; + bool cold_bb = probably_never_executed_bb_p (cfun, bb); and this too (basically all the tweaks should IMO go in independently and ideally in a way that does not need flag_auto_profile test). Done. +/* Return true if BB contains indirect call. */ + +static bool +has_indirect_call (basic_block bb) +{ + gimple_stmt_iterator gsi; + + for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (gsi)) +{ + gimple stmt = gsi_stmt (gsi); + if (gimple_code (stmt) == GIMPLE_CALL +(gimple_call_fn (stmt) == NULL + || TREE_CODE (gimple_call_fn (stmt)) != FUNCTION_DECL)) You probably want to skip gimple_call_internal_p calls here. Done + +/* From AutoFDO profiles, find values inside STMT for that we want to measure + histograms for indirect-call optimization. */ + +static void +afdo_indirect_call (gimple_stmt_iterator *gsi, const icall_target_map map, + bool transform) +{ + gimple stmt = gsi_stmt (*gsi); + tree callee; + + if (map.size() == 0 || gimple_code (stmt) != GIMPLE_CALL + || gimple_call_fndecl (stmt) != NULL_TREE) +return; + + callee = gimple_call_fn (stmt); + + histogram_value hist = gimple_alloc_histogram_value ( + cfun, HIST_TYPE_INDIR_CALL, stmt, callee); + hist-n_counters = 3; + hist-hvalue.counters = XNEWVEC (gcov_type, hist-n_counters); + gimple_add_histogram_value (cfun, stmt, hist); + + gcov_type total = 0; + icall_target_map::const_iterator max_iter = map.end(); + + for (icall_target_map::const_iterator iter = map.begin(); + iter != map.end(); ++iter) +{ + total += iter-second; + if (max_iter == map.end() || max_iter-second iter-second) + max_iter = iter; +} + + hist-hvalue.counters[0] = (unsigned long long) + afdo_string_table-get_name (max_iter-first); + hist-hvalue.counters[1] = max_iter-second; + hist-hvalue.counters[2] = total; + + if (!transform) +return; + + if (gimple_ic_transform (gsi)) +{ + struct cgraph_edge *indirect_edge = + cgraph_node::get (current_function_decl)-get_edge (stmt); + struct cgraph_node *direct_call = + find_func_by_profile_id ((int)hist-hvalue.counters[0]); + if (DECL_STRUCT_FUNCTION (direct_call-decl) == NULL) + return; + struct cgraph_edge *new_edge = + indirect_edge-make_speculative (direct_call, 0, 0); + new_edge-redirect_call_stmt_to_callee (); + gimple_remove_histogram_value (cfun, stmt, hist); + inline_call (new_edge, true, NULL, NULL, false); + return; +} + return; Is it necessary to go via histogram and gimple_ic_transform here? I would expect that all you need is to make the speculative edge and inline it. (bypassing the work of producing fake histogram value and calling igmple_ic_transofrm on it) Also it seems to me that you want to set direct_count nad frequency argument of make_speculative so the resulting function profile is not off. This function is actually served for 2 purposes: * before annotation, we need to mark histogram, promote and inline * after annotation, we just need to mark, and let follow-up logic to decide if it needs to promote and inline. And you are right, for the before annotation case, we can simply call mark speculative and inline. But we still need the logic to fake histogram for after annotation case. As a result, I unified two cases into one function to reuse code as much as possible. Shall I separate it into two functions instead? The rest of interfaces seems quite sane now. Can you please look into using speculative edges directly instead of hooking into the vpt infrastructure and fixing the formatting issues of the new pass? I'll work on the formatting issues now (need to learn the format first ;-). The attached patch is up-to-date except for formatting changes. I'll upload the patch again once the format change is in. Thanks, Dehao I will try to make another pass over
Re: [PATCH] AutoFDO patch for trunk
The new patch is attached. I used clang-format for format auto-profile.{c|h} Thanks, Dehao On Tue, Oct 14, 2014 at 2:05 PM, Dehao Chen de...@google.com wrote: On Tue, Oct 14, 2014 at 8:02 AM, Jan Hubicka hubi...@ucw.cz wrote: Index: gcc/cgraphclones.c === --- gcc/cgraphclones.c(revision 215826) +++ gcc/cgraphclones.c(working copy) @@ -453,6 +453,11 @@ } else count_scale = 0; + /* In AutoFDO, if edge count is larger than callee's entry block + count, we will not update the original callee because it may + mistakenly mark some hot function as cold. */ + if (flag_auto_profile gcov_count = count) +update_original = false; lets drop this from initial patch. Done Index: gcc/bb-reorder.c === --- gcc/bb-reorder.c (revision 215826) +++ gcc/bb-reorder.c (working copy) @@ -1569,15 +1569,14 @@ /* Mark which partition (hot/cold) each basic block belongs in. */ FOR_EACH_BB_FN (bb, cfun) { - bool cold_bb = false; + bool cold_bb = probably_never_executed_bb_p (cfun, bb); and this too (basically all the tweaks should IMO go in independently and ideally in a way that does not need flag_auto_profile test). Done. +/* Return true if BB contains indirect call. */ + +static bool +has_indirect_call (basic_block bb) +{ + gimple_stmt_iterator gsi; + + for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (gsi)) +{ + gimple stmt = gsi_stmt (gsi); + if (gimple_code (stmt) == GIMPLE_CALL +(gimple_call_fn (stmt) == NULL + || TREE_CODE (gimple_call_fn (stmt)) != FUNCTION_DECL)) You probably want to skip gimple_call_internal_p calls here. Done + +/* From AutoFDO profiles, find values inside STMT for that we want to measure + histograms for indirect-call optimization. */ + +static void +afdo_indirect_call (gimple_stmt_iterator *gsi, const icall_target_map map, + bool transform) +{ + gimple stmt = gsi_stmt (*gsi); + tree callee; + + if (map.size() == 0 || gimple_code (stmt) != GIMPLE_CALL + || gimple_call_fndecl (stmt) != NULL_TREE) +return; + + callee = gimple_call_fn (stmt); + + histogram_value hist = gimple_alloc_histogram_value ( + cfun, HIST_TYPE_INDIR_CALL, stmt, callee); + hist-n_counters = 3; + hist-hvalue.counters = XNEWVEC (gcov_type, hist-n_counters); + gimple_add_histogram_value (cfun, stmt, hist); + + gcov_type total = 0; + icall_target_map::const_iterator max_iter = map.end(); + + for (icall_target_map::const_iterator iter = map.begin(); + iter != map.end(); ++iter) +{ + total += iter-second; + if (max_iter == map.end() || max_iter-second iter-second) + max_iter = iter; +} + + hist-hvalue.counters[0] = (unsigned long long) + afdo_string_table-get_name (max_iter-first); + hist-hvalue.counters[1] = max_iter-second; + hist-hvalue.counters[2] = total; + + if (!transform) +return; + + if (gimple_ic_transform (gsi)) +{ + struct cgraph_edge *indirect_edge = + cgraph_node::get (current_function_decl)-get_edge (stmt); + struct cgraph_node *direct_call = + find_func_by_profile_id ((int)hist-hvalue.counters[0]); + if (DECL_STRUCT_FUNCTION (direct_call-decl) == NULL) + return; + struct cgraph_edge *new_edge = + indirect_edge-make_speculative (direct_call, 0, 0); + new_edge-redirect_call_stmt_to_callee (); + gimple_remove_histogram_value (cfun, stmt, hist); + inline_call (new_edge, true, NULL, NULL, false); + return; +} + return; Is it necessary to go via histogram and gimple_ic_transform here? I would expect that all you need is to make the speculative edge and inline it. (bypassing the work of producing fake histogram value and calling igmple_ic_transofrm on it) Also it seems to me that you want to set direct_count nad frequency argument of make_speculative so the resulting function profile is not off. This function is actually served for 2 purposes: * before annotation, we need to mark histogram, promote and inline * after annotation, we just need to mark, and let follow-up logic to decide if it needs to promote and inline. And you are right, for the before annotation case, we can simply call mark speculative and inline. But we still need the logic to fake histogram for after annotation case. As a result, I unified two cases into one function to reuse code as much as possible. Shall I separate it into two functions instead? The rest of interfaces seems quite sane now. Can you please look into using speculative edges directly instead of hooking into the vpt infrastructure and fixing the formatting issues of the new pass? I'll work on the formatting issues now (need to learn the format first ;-). The
Re: [PATCH] AutoFDO patch for trunk
Dehao Chen de...@google.com writes: + +@item -fauto-profile +@itemx -fauto-profile=@var{path} +@opindex fauto-profile +Enable sampling based feedback directed optimizations, and optimizations +generally profitable only with profile feedback available. + +The following options are enabled: @code{-fbranch-probabilities}, @code{-fvpt}, +@code{-funroll-loops}, @code{-fpeel-loops}, @code{-ftracer}, @code{-ftree-vectorize}, +@code{ftree-loop-distribute-patterns} This needs more description aimed end-users, what it is good for and why, and a pointer to the needed utilities and a short summary what steps they need to take. -Andi -- a...@linux.intel.com -- Speaking for myself only
Re: [PATCH] AutoFDO patch for trunk
Index: gcc/bb-reorder.c === --- gcc/bb-reorder.c (revision 210180) +++ gcc/bb-reorder.c (working copy) @@ -2663,7 +2663,7 @@ pass_partition_blocks::gate (function *fun) user defined section attributes. Don't call it if either case arises. */ return (flag_reorder_blocks_and_partition -optimize +optimize !flag_auto_profile Formatting error. Why we want !flag_auto_profile? Can't we just arrange flag_reorder_blocks_and_partition to default to false? In fact I would like to see flag_reorder_blocks_and_partition to offload obviously cold code (in partiuclar the EH cleanups) even without profile. That should work well with autofdo. This will cause bzip2 performance to degrade 6%. I haven't had time to triage the problem. Will investigate this later. Still I would preffer to make this by default flag_reorder_blocks_and_partition to false with auto_profile. We could do that incrementally, lets just drop this from initial patch. - if (flag_branch_probabilities) + if (flag_auto_profile) +init_auto_profile (); Perhaps initialization can happen from toplev.c instead? This is actually reading of afdo profile. I've renamed the function to make it consistent. OK. Index: gcc/cgraphclones.c === --- gcc/cgraphclones.c(revision 210180) +++ gcc/cgraphclones.c(working copy) @@ -435,6 +435,11 @@ cgraph_clone_node (struct cgraph_node *n, tree dec } else count_scale = 0; + /* In AutoFDO, if edge count is larger than callee's entry block + count, we will not update the original callee because it may + mistakenly mark some hot function as cold. */ + if (flag_auto_profile count = n-count) +update_original = false; This looks like a hack. Lets go with it later - I see what you are shooting for here and non-autoFDO has kind of same problem, too. Counts are not at all that reliable after some inlining. I wonder how this hack would fare for -fprofile-use It does not affect -fprofile-use performance in my testing. However, for AutoFDO, this happens much more often. Yep, lets however drop it from the initial patch and make the tweaks go in later. @@ -1286,6 +1285,9 @@ del_node_map (void) struct cgraph_node* find_func_by_profile_id (int profile_id) { + if (flag_auto_profile) +return cgraph_node_for_asm ( + get_identifier ((const char *)(size_t)profile_id)); I think with LTO we will need to make this machinery safe for symbol mangling. Any plans on this? No plan yet. Let's get AutoFDO in first, and spend some separate effort to make AutoFDO+LTO work? OK. Index: gcc/ipa-inline.h === --- gcc/ipa-inline.h (revision 210180) +++ gcc/ipa-inline.h (working copy) @@ -345,3 +345,5 @@ reset_edge_growth_cache (struct cgraph_edge *edge) edge_growth_cache[edge-uid] = zero; } } + +unsigned int early_inliner (function *fun); Can't this be handled by scheduling early inliner as subpass of autofdo or something similar? I would rather make this explicit in PM. See my comment below in auto-profile.c Index: gcc/auto-profile.c === --- gcc/auto-profile.c(revision 0) +++ gcc/auto-profile.c(revision 0) + After the above 3 phases, all profile is readily annotated on the GCC IR. + AutoFDO tries to reuse all FDO infrastructure as much as possible to make + use of the profile. E.g. it uses existing mechanism to calculate the basic + block/edge frequency, as well as the cgraph node/edge count. +*/ I suppose the phases can be made explicit to the PM, so early inliner can stay a pass. Unfortunately, this is more complicated: while(changed) { vpt(); einline(); annotate(); } Hmm, PM do not support iteration (though it wouldmake sense in some cases at least for expensive optimization). Why you need to iterate einline with the other two? +/* Store a string array, indexed by string position in the array. */ +class string_table { +public: + static string_table *create (); + + /* For a given string, returns its index. */ + int get_index (const char *name) const; + + /* For a given decl, returns the index of the decl name. */ + int get_index_by_decl (tree decl) const; + + /* For a given index, returns the string. */ + const char *get_name (int index) const; I wonder how these should work with LTO. I suppose we can tell user to not LTO for train run, we should teach the table to ignore LTO generated suffixes as well as random seeds (like coverage code already does). What happens when two static functions
Re: [PATCH] AutoFDO patch for trunk
This will cause bzip2 performance to degrade 6%. I haven't had time to triage the problem. Will investigate this later. Still I would preffer to make this by default flag_reorder_blocks_and_partition to false with auto_profile. We could do that incrementally, lets just drop this from initial patch. Sure, I'll remove that for now. Index: gcc/cgraphclones.c === --- gcc/cgraphclones.c(revision 210180) +++ gcc/cgraphclones.c(working copy) @@ -435,6 +435,11 @@ cgraph_clone_node (struct cgraph_node *n, tree dec } else count_scale = 0; + /* In AutoFDO, if edge count is larger than callee's entry block + count, we will not update the original callee because it may + mistakenly mark some hot function as cold. */ + if (flag_auto_profile count = n-count) +update_original = false; This looks like a hack. Lets go with it later - I see what you are shooting for here and non-autoFDO has kind of same problem, too. Counts are not at all that reliable after some inlining. I wonder how this hack would fare for -fprofile-use It does not affect -fprofile-use performance in my testing. However, for AutoFDO, this happens much more often. Yep, lets however drop it from the initial patch and make the tweaks go in later. OK Index: gcc/auto-profile.c === --- gcc/auto-profile.c(revision 0) +++ gcc/auto-profile.c(revision 0) + After the above 3 phases, all profile is readily annotated on the GCC IR. + AutoFDO tries to reuse all FDO infrastructure as much as possible to make + use of the profile. E.g. it uses existing mechanism to calculate the basic + block/edge frequency, as well as the cgraph node/edge count. +*/ I suppose the phases can be made explicit to the PM, so early inliner can stay a pass. Unfortunately, this is more complicated: while(changed) { vpt(); einline(); annotate(); } Hmm, PM do not support iteration (though it wouldmake sense in some cases at least for expensive optimization). Why you need to iterate einline with the other two? It's for iterative AFDO. i.e. the profile could be collected from AFDO-optimized binary. E.g. main() ind_call foo() foo() ind_call bar() First, profile is collected from O2, thus foo will be promoted and inlined into main. Then we profile on this AFDO-optimized binary, before annotation, we need to make sure that foo is promoted and inlined main. And in the next iteration, before annotation, we need to make sure bar is promoted and inlined into foo, which is promoted and inlined into main. +/* Store a string array, indexed by string position in the array. */ +class string_table { +public: + static string_table *create (); + + /* For a given string, returns its index. */ + int get_index (const char *name) const; + + /* For a given decl, returns the index of the decl name. */ + int get_index_by_decl (tree decl) const; + + /* For a given index, returns the string. */ + const char *get_name (int index) const; I wonder how these should work with LTO. I suppose we can tell user to not LTO for train run, we should teach the table to ignore LTO generated suffixes as well as random seeds (like coverage code already does). What happens when two static functions have same name? Shall we first focus on non-LTO version. And there could be separate effort to make AutoFDO+LTO work? For static functions, there will only be one copy in the profile. the profile will be merged. I agree that this could potentially cause problem. But OTOH, in a good practice, these functions should *not* be hot functions. At least in our practice, we didn't see any hot functions like this. You mean that static functions in differnt units that have same name tends to be not hot? I don't really follow it... I mean that if two static functions are both hot, it's better for them to have different names. Otherwise performance tuning/debuging would be super hard. I agree. My original intention is to make the function name handling as simple as possible. The current impl is based on the asumption that all important (hot) functions deserve a separate name (not just suffix difference). We could definitely refine the function name handling in the future patches. I would definitely merge logic here with logic in gcov that already knows how to skip random seeds and other cases that makes filenames to diverge... Could you point me to the gcov logic that handles skipping function name suffix? See coverage_checksum_string in coverage.c Looks like that function is filling in randomly generated numbers by 0. This was not what I want because compiler may not even generate these random numbers. E.g. for a given function, compiler
Re: [PATCH] AutoFDO patch for trunk
Hi, Honza, Sorry for the delay. I just picked up the original patch, and updated it with your comments. I've addressed most of your comments. Something else to discuss inlined. I had refactored the patch to make it much less intrusive. New patch is attached (ChangeLog will be added in the final version of the patch). The performance of the new patch is the same as the original patch. Thanks, Dehao Index: gcc/bb-reorder.c === --- gcc/bb-reorder.c (revision 210180) +++ gcc/bb-reorder.c (working copy) @@ -2663,7 +2663,7 @@ pass_partition_blocks::gate (function *fun) user defined section attributes. Don't call it if either case arises. */ return (flag_reorder_blocks_and_partition -optimize +optimize !flag_auto_profile Formatting error. Why we want !flag_auto_profile? Can't we just arrange flag_reorder_blocks_and_partition to default to false? In fact I would like to see flag_reorder_blocks_and_partition to offload obviously cold code (in partiuclar the EH cleanups) even without profile. That should work well with autofdo. This will cause bzip2 performance to degrade 6%. I haven't had time to triage the problem. Will investigate this later. Index: gcc/coverage.c === --- gcc/coverage.c(revision 210180) +++ gcc/coverage.c(working copy) @@ -54,6 +54,7 @@ along with GCC; see the file COPYING3. If not see #include intl.h #include filenames.h #include target.h +#include auto-profile.h #include gcov-io.h #include gcov-io.c @@ -1175,7 +1176,9 @@ coverage_init (const char *filename) bbg_file_stamp = local_tick; - if (flag_branch_probabilities) + if (flag_auto_profile) +init_auto_profile (); Perhaps initialization can happen from toplev.c instead? This is actually reading of afdo profile. I've renamed the function to make it consistent. Index: gcc/cgraphclones.c === --- gcc/cgraphclones.c(revision 210180) +++ gcc/cgraphclones.c(working copy) @@ -435,6 +435,11 @@ cgraph_clone_node (struct cgraph_node *n, tree dec } else count_scale = 0; + /* In AutoFDO, if edge count is larger than callee's entry block + count, we will not update the original callee because it may + mistakenly mark some hot function as cold. */ + if (flag_auto_profile count = n-count) +update_original = false; This looks like a hack. Lets go with it later - I see what you are shooting for here and non-autoFDO has kind of same problem, too. Counts are not at all that reliable after some inlining. I wonder how this hack would fare for -fprofile-use It does not affect -fprofile-use performance in my testing. However, for AutoFDO, this happens much more often. @@ -1286,6 +1285,9 @@ del_node_map (void) struct cgraph_node* find_func_by_profile_id (int profile_id) { + if (flag_auto_profile) +return cgraph_node_for_asm ( + get_identifier ((const char *)(size_t)profile_id)); I think with LTO we will need to make this machinery safe for symbol mangling. Any plans on this? No plan yet. Let's get AutoFDO in first, and spend some separate effort to make AutoFDO+LTO work? Index: gcc/ipa-inline.h === --- gcc/ipa-inline.h (revision 210180) +++ gcc/ipa-inline.h (working copy) @@ -345,3 +345,5 @@ reset_edge_growth_cache (struct cgraph_edge *edge) edge_growth_cache[edge-uid] = zero; } } + +unsigned int early_inliner (function *fun); Can't this be handled by scheduling early inliner as subpass of autofdo or something similar? I would rather make this explicit in PM. See my comment below in auto-profile.c Index: gcc/auto-profile.c === --- gcc/auto-profile.c(revision 0) +++ gcc/auto-profile.c(revision 0) + After the above 3 phases, all profile is readily annotated on the GCC IR. + AutoFDO tries to reuse all FDO infrastructure as much as possible to make + use of the profile. E.g. it uses existing mechanism to calculate the basic + block/edge frequency, as well as the cgraph node/edge count. +*/ I suppose the phases can be made explicit to the PM, so early inliner can stay a pass. Unfortunately, this is more complicated: while(changed) { vpt(); einline(); annotate(); } +/* Store a string array, indexed by string position in the array. */ +class string_table { +public: + static string_table *create (); + + /* For a given string, returns its index. */ + int get_index (const char *name) const; + + /* For a given decl, returns the index of the decl name. */ + int get_index_by_decl (tree decl) const; + + /* For a given index, returns the string. */ + const char
Re: [PATCH] AutoFDO patch for trunk
Hi, I'm planning to port the AutoFDO patch upstream. Attached is the prepared patch. You can also find the patch in http://codereview.appspot.com/99010043 I've tested the patch with SPECCPU2006. For the CINT2006 benchmarks, the speedup comparison between O2, FDO and AutoFDO is as follows: Reference: o2 (1): auto_fdo (2): fdo Benchmark Base:Reference(1) (2) - spec/2006/int/C++/471.omnetpp 23.18 +3.11% +5.09% spec/2006/int/C++/473.astar 21.15 +6.79% +9.80% spec/2006/int/C++/483.xalancbmk 36.68 +11.56% +14.47% spec/2006/int/C/400.perlbench 34.57 +6.59% +18.56% spec/2006/int/C/401.bzip2 23.17 +0.95% +2.49% spec/2006/int/C/403.gcc 32.33 +8.27% +9.76% spec/2006/int/C/429.mcf 42.13 +4.72% +5.23% spec/2006/int/C/445.gobmk 26.53 -1.39% +0.05% spec/2006/int/C/456.hmmer 23.72 +7.12% +7.87% spec/2006/int/C/458.sjeng 26.17 +4.65% +6.04% spec/2006/int/C/462.libquantum57.23 +4.04% +1.42% spec/2006/int/C/464.h264ref46.3 +1.07% +8.97% geometric mean+4.73% +7.36% The results are very nice indeed. So basically it adds another 50% to static estimation, that is not bad. The patch is long and missing a changelog, I will add my comments and I think it would be great to break it up where possible - for some infrastructure preparation patches and the actual reading infrastructure. Index: gcc/cfghooks.c === --- gcc/cfghooks.c(revision 210180) +++ gcc/cfghooks.c(working copy) @@ -833,6 +833,9 @@ make_forwarder_block (basic_block bb, bool (*redir fallthru = split_block_after_labels (bb); dummy = fallthru-src; + dummy-count = 0; + dummy-frequency = 0; + fallthru-count = 0; bb = fallthru-dest; /* Redirect back edges we want to keep. */ @@ -842,20 +845,13 @@ make_forwarder_block (basic_block bb, bool (*redir if (redirect_edge_p (e)) { + dummy-frequency += EDGE_FREQUENCY (e); + dummy-count += e-count; + fallthru-count += e-count; ei_next (ei); continue; } - dummy-frequency -= EDGE_FREQUENCY (e); - dummy-count -= e-count; - if (dummy-frequency 0) - dummy-frequency = 0; - if (dummy-count 0) - dummy-count = 0; - fallthru-count -= e-count; - if (fallthru-count 0) - fallthru-count = 0; - Can you elaborate why these changes are needed? They look unrelated, so it is a bug in forwarder block profile updating? I do not see why source of the edge needs to have profile cleaned. Index: gcc/loop-unroll.c === --- gcc/loop-unroll.c (revision 210180) +++ gcc/loop-unroll.c (working copy) @@ -998,6 +998,13 @@ decide_unroll_runtime_iterations (struct loop *loo return; } + /* In AutoFDO, the profile is not accurate. If the calculated trip count + is larger than the header count, then the profile is not accurate + enough to make correct unroll decisions. */ + if (flag_auto_profile + expected_loop_iterations (loop) loop-header-count) +return; This is another independent change. It does make sense, but I would preffer to have it in a separate function (expected_loop_iterations_reliable_p) rather than inline in the code. There are other loop optimizations that are driven by this, and they should consistently use this predicate. In fact current -fprofile-use enabled -funroll-loops flag that enables all unrolling independently of reliablility of the profile. I suppose we need to imporove this scheme and have something liek unroll-loop-with-reliable-profile path that would default for profile feedback optimization. I also do not think your check makes a lot of sense. You want to compare loop header count and loop preheader or so, or this is not reliable enough for auto-FDO profiles? In any case, lets handle this separately of the main autoFDO patches (and not forget on it). Index: gcc/dwarf2out.c === --- gcc/dwarf2out.c (revision 210180) +++ gcc/dwarf2out.c (working copy) @@ -2481,6 +2481,40 @@ const struct gcc_debug_hooks dwarf2_debug_hooks = 1,/* start_end_main_source_file */ TYPE_SYMTAB_IS_DIE/* tree_type_symtab_field */ }; + +const struct gcc_debug_hooks auto_profile_debug_hooks = +{ + debug_nothing_charstar, + debug_nothing_charstar, + debug_nothing_void, + debug_nothing_int_charstar, + debug_nothing_int_charstar, + debug_nothing_int_charstar, + debug_nothing_int, +
Re: [PATCH] AutoFDO patch for trunk
Index: gcc/auto-profile.c === --- gcc/auto-profile.c(revision 0) +++ gcc/auto-profile.c(revision 0) @@ -0,0 +1,1584 @@ +/* Calculate branch probabilities, and basic block execution counts. Update the toplevel comment, too. + +/* The following routines implements AutoFDO optimization. + + This optimization uses sampling profiles to annotate basic block counts + and uses heuristics to estimate branch probabilities. + + There are three phases in AutoFDO: + + Phase 1: Read profile from the profile data file. + The following info is read from the profile datafile: + * string_table: a map between function name and its index. + * autofdo_source_profile: a map from function_instance name to + function_instance. This is represented as a forest of + function_instances. + * WorkingSet: a histogram of how many instructions are covered for a + given percentage of total cycles. + + Phase 2: Early inline. + Early inline uses autofdo_source_profile to find if a callsite is: + * inlined in the profiled binary. + * callee body is hot in the profiling run. + If both condition satisfies, early inline will inline the callsite + regardless of the code growth. + + Phase 3: Annotate control flow graph. + AutoFDO uses a separate pass to: + * Annotate basic block count + * Estimate branch probability + + After the above 3 phases, all profile is readily annotated on the GCC IR. + AutoFDO tries to reuse all FDO infrastructure as much as possible to make + use of the profile. E.g. it uses existing mechanism to calculate the basic + block/edge frequency, as well as the cgraph node/edge count. +*/ I suppose the phases can be made explicit to the PM, so early inliner can stay a pass. +/* Store a string array, indexed by string position in the array. */ +class string_table { +public: + static string_table *create (); + + /* For a given string, returns its index. */ + int get_index (const char *name) const; + + /* For a given decl, returns the index of the decl name. */ + int get_index_by_decl (tree decl) const; + + /* For a given index, returns the string. */ + const char *get_name (int index) const; I wonder how these should work with LTO. I suppose we can tell user to not LTO for train run, we should teach the table to ignore LTO generated suffixes as well as random seeds (like coverage code already does). What happens when two static functions have same name? +/* Profile for all functions. */ +class autofdo_source_profile { +public: + static autofdo_source_profile *create () +{ + autofdo_source_profile *map = new autofdo_source_profile (); I think we still want empty lines here. Otherwise I trust you on following GCC C++ coding standards :) + if (map-read ()) + return map; + delete map; + return NULL; +} +/* Helper functions. */ + +/* Return the original name of NAME: strip the suffix that starts + with '.' */ + +static const char *get_original_name (const char *name) +{ + char *ret = xstrdup (name); + char *find = strchr (ret, '.'); + if (find != NULL) +*find = 0; + return ret; +} Stripping all suffixes will likely make conflicts on static functions, right? Example like this will get you a conflict m() { void t(void) { } t(); } m2() { void t2(void) { } t2(); } that is of course not that important given that nested functions are not top priority but it would be nice to handle it gratefuly. I would definitely merge logic here with logic in gcov that already knows how to skip random seeds and other cases that makes filenames to diverge... + +/* Return the combined location, which is a 32bit integer in which + higher 16 bits stores the line offset of LOC to the start lineno + of DECL, The lower 16 bits stores the discrimnator. */ + +static unsigned +get_combined_location (location_t loc, tree decl) +{ + return ((LOCATION_LINE (loc) - DECL_SOURCE_LINE (decl)) 16); +} What happens on overflows here? I guess they are possible - 65536 lines is not infinity these days. The discriminator is added later? + +int +string_table::get_index (const char *name) const +{ + if (name == NULL) +return -1; + string_index_map::const_iterator iter = map_.find (name); + if (iter == map_.end()) +return -1; + else +return iter-second; +} + +int +string_table::get_index_by_decl (tree decl) const Comments missing here. + +/* Read the profile and create a function_instance with head count as + HEAD_COUNT. Recursively read callsites to create nested function_instances + too. STACK is used to track the recursive creation process. */ + +function_instance * +function_instance::read_function_instance ( +function_instance_stack *stack, gcov_type head_count) +{ + unsigned name =
Re: [PATCH] AutoFDO patch for trunk
Have you announced the autofdo profile tool to gcc list? David On Wed, May 7, 2014 at 2:24 PM, Dehao Chen de...@google.com wrote: Hi, I'm planning to port the AutoFDO patch upstream. Attached is the prepared patch. You can also find the patch in http://codereview.appspot.com/99010043 I've tested the patch with SPECCPU2006. For the CINT2006 benchmarks, the speedup comparison between O2, FDO and AutoFDO is as follows: Reference: o2 (1): auto_fdo (2): fdo Benchmark Base:Reference(1) (2) - spec/2006/int/C++/471.omnetpp 23.18 +3.11% +5.09% spec/2006/int/C++/473.astar 21.15 +6.79% +9.80% spec/2006/int/C++/483.xalancbmk 36.68 +11.56% +14.47% spec/2006/int/C/400.perlbench 34.57 +6.59% +18.56% spec/2006/int/C/401.bzip2 23.17 +0.95% +2.49% spec/2006/int/C/403.gcc 32.33 +8.27% +9.76% spec/2006/int/C/429.mcf 42.13 +4.72% +5.23% spec/2006/int/C/445.gobmk 26.53 -1.39% +0.05% spec/2006/int/C/456.hmmer 23.72 +7.12% +7.87% spec/2006/int/C/458.sjeng 26.17 +4.65% +6.04% spec/2006/int/C/462.libquantum57.23 +4.04% +1.42% spec/2006/int/C/464.h264ref46.3 +1.07% +8.97% geometric mean+4.73% +7.36% The majority of the performance difference between AutoFDO and FDO comes from the lack of instruction level discriminator support. Cary Coutant is planning to port that patch upstream too. Please let me know if you have any question about this patch, and thanks in advance for reviewing such a huge patch. Dehao