Re: AutoFDO profile toolchain is open-sourced
Yes, it will. But it's not well tuned at all. I will start tuning it if I have free cycles. It would be great if opensource community can also contribute to this tuning effort. If you could outline portions of code which needs tuning, rewriting, that will help get started in this effort. Optimization passes in GCC are generally designed to work with any kind of edge profile they get. There are only few cases where they do care about what profile is around. At the moment we consider two types of profiles - static (guessed) and FDO. For static one we shut down use of profile info for some heuristics - for example we do not expect loop trip counts to be reliable in the profiles because they are not. You can look for code checking profile_status_for_fn. Auto-FDO does not have special value for profile_status_for_fn and it goes with same code paths for FDO. Dehao has some patches for Auto-FDO tuning but my impression is that he got mostly got around by just makng optimizer bit more robust for nonsential profiles that is always good, since even FDO profiles can get wrong. BTW, Dehao, do you think you can submit these changes for this stage1? I suppose in this case we have yet another kind of profile that is less reliable than FDO and we need to start by simply benchmarking and looking for cases where this profile gets worse and handle them one by one :) Honza
Passsing unfinished types to gen_type_die_with_usage
Hello, as I mentioned yesterday on IRC adding a check that only complete types have TYPE_BINFO defined triggers on type: record_type 0x76d24d20 __is_integer type_5 type_6 VOID align 8 symtab 0 alias set -1 canonical type 0x76d24d20 fields const_decl 0x74a89930 __value type enumeral_type 0x76d260a8 ._8 type integer_type 0x76d261f8 unsigned int unsigned SI size integer_cst 0x76d0e0a8 constant 32 unit size integer_cst 0x76d0e0c0 constant 4 align 32 symtab 0 alias set -1 canonical type 0x76d260a8 precision 32 min integer_cst 0x76d0e0d8 0 max integer_cst 0x76d0e090 4294967295 values tree_list 0x76d2e078 purpose identifier_node 0x76ea9c08 __value bindings (nil) local bindings 0x76d2a640 value const_decl 0x74a89930 __value context record_type 0x76d24d20 __is_integer chain type_decl 0x74aa4da8 ._8 readonly constant VOID file /aux/hubicka/trunk-7/build/x86_64-unknown-linux-gnu/libstdc++-v3/include/bits/cpp_type_traits.h line 127 col 14 align 1 context enumeral_type 0x76d260a8 ._8 initial integer_cst 0x76eea1c8 0 chain type_decl 0x74aa4a18 __is_integer type record_type 0x76d24dc8 __is_integer external nonlocal in_system_header suppress-debug decl_4 VOID file /aux/hubicka/trunk-7/build/x86_64-unknown-linux-gnu/libstdc++-v3/include/bits/cpp_type_traits.h line 126 col 5 align 8 context record_type 0x76d24d20 __is_integer result record_type 0x76d24d20 __is_integer chain type_decl 0x74aa4da8 ._8 context namespace_decl 0x76d08098 std full-name struct std::__is_integerlong double n_parents=0 use_template=1 interface-unknown chain type_decl 0x74aa4688 __is_integer This type has fields but no TYPE_SIZE, so it seems odd. It is an instance, not template, so I think the probablem is that layout_type was not called on it. I debugged the reason why we get type with to TYPE_SIZE but with BINFO. The problem is that dwarf2out is visits this type before it is finished. The BINFO is added as: #0 xref_basetypes (ref=0x76d24d20, base_list=0x0) at ../../gcc/cp/decl.c:12525 #1 0x0078b786 in instantiate_class_template_1 (type=0x76d24d20) at ../../gcc/cp/pt.c:9256 #2 0x0078d3e7 in instantiate_class_template (type=0x76d24d20) at ../../gcc/cp/pt.c:9650 #3 0x0088772c in complete_type (type=0x76d24d20) at ../../gcc/cp/typeck.c:146 #4 0x008dba41 in lookup_member (xbasetype=0x0, name=0x76ea9c08, protect=2, want_type=false, complain=3) at ../../gcc/cp/search.c:1213 #5 0x009544db in lookup_qualified_name (scope=0x76d24d20, name=0x76ea9c08, is_type_p=false, complain=false) at ../../gcc/cp/name-lookup.c:4540 #6 0x0079ca43 in tsubst_qualified_id (qualified_id=0x76d2a4d8, args=0x76d2caf0, complain=3, in_decl=0x74ac9a00, done=true, address_p=false) at ../../gcc/cp/pt.c:12717 #7 0x007a80df in tsubst_copy_and_build (t=0x76d2a4d8, args=0x76d2caf0, complain=3, in_decl=0x74ac9a00, function_p=false, integral_constant_expression_p=true) at ../../gcc/cp/pt.c:14760 #8 0x007a67b4 in tsubst_expr (t=0x76d2a4d8, args=0x76d2caf0, complain=3, in_decl=0x74ac9a00, integral_constant_expression_p=true) at ../../gcc/cp/pt.c:14367 #9 0x0078d491 in tsubst_template_arg (t=0x76d2a4d8, args=0x76d2caf0, complain=3, in_decl=0x74ac9a00) at ../../gcc/cp/pt.c:9669 #10 0x0077f3ca in coerce_template_parms (parms=0x76d2a550, args=0x76e83960, in_decl=0x74ac9a00, complain=3, require_all_args=true, use_default_args=true) at ../../gcc/cp/pt.c:7142 #11 0x0077fa17 in coerce_innermost_template_parms (parms=0x76d2a578, args=0x76e83960, in_decl=0x74ac9a00, complain=3, require_all_args=true, use_default_args=true) at ../../gcc/cp/pt.c:7252 #12 0x00782455 in lookup_template_class_1 (d1=0x76eab6e0, arglist=0x76e83960, in_decl=0x0, context=0x76e5bed8, entering_scope=0, complain=3) at ../../gcc/cp/pt.c:7727 #13 0x007854e4 in lookup_template_class (d1=0x74ac9a00, arglist=0x76e83960, in_decl=0x0, context=0x0, entering_scope=0, complain=35) at ../../gcc/cp/pt.c:8043 #14 0x008eaefe in finish_template_type (name=0x74ac9a00, args=0x76e83960, entering_scope=0) at ../../gcc/cp/semantics.c:3011 #15 0x0084c7a4 in cp_parser_template_id (parser=0x76be8c00, template_keyword_p=false, check_dependency_p=false, tag_type=class_type, is_declaration=false) at ../../gcc/cp/parser.c:13850 and dwarf2out visits it as context when processing another type: #0 error (gmsgid=0x1abd1a8 TYPE_BINFO and COMPLETE_TYPE_P mismatch) at ../../gcc/diagnostic.c:1135 #1 0x012f64c7 in verify_type
Re: g++keeps unused objects with virtual functions
which shows how the global objects initialization keeps things live. Early optimization turns it into (static initializers for t.C) () { bb 2: NotUsedObject._vptr.CObject = MEM[(void *)_ZTV7CObject + 16B]; return; } but we don't have any pass removing stores to globals never read. IPA reference computes this in some way but doesn't get to export this in a reasonable way - its transform stage could DSE those though. We have write only variable detection but NotUsedObject/4 (CObject NotUsedObject) @0x7f70d41fe580 Type: variable definition analyzed Visibility: public References: Referring: _Z41__static_initialization_and_destruction_0ii/9 (addr) Availability: not-ready Varpool flags: it stops on fact believing that address of NotUsedObject is taken. Why it is not considered to be load? Honza
Re: g++keeps unused objects with virtual functions
which shows how the global objects initialization keeps things live. Early optimization turns it into (static initializers for t.C) () { bb 2: NotUsedObject._vptr.CObject = MEM[(void *)_ZTV7CObject + 16B]; return; } but we don't have any pass removing stores to globals never read. IPA reference computes this in some way but doesn't get to export this in a reasonable way - its transform stage could DSE those though. We have write only variable detection but NotUsedObject/4 (CObject NotUsedObject) @0x7f70d41fe580 Type: variable definition analyzed Visibility: public References: Referring: _Z41__static_initialization_and_destruction_0ii/9 (addr) Availability: not-ready Varpool flags: it stops on fact believing that address of NotUsedObject is taken. Why it is not considered to be load? Too early dump I guess. At mainline I get empty constructor: .type _GLOBAL__sub_I_NotUsedObject, @function _GLOBAL__sub_I_NotUsedObject: .LFB6: rep ret not ideal, but still an improvement ;) The problem is that write only var is removed only at late optimization and then it is too late to remove the unreachable function. Honza Honza
Re: AutoFDO profile toolchain is open-sourced
On Tue, Apr 7, 2015 at 9:45 AM, Ilya Palachev i.palac...@samsung.com wrote: In the mentioned README file it is said that In order to collect this profile, you will need to have an Intel CPU that have last branch record (LBR) support. Is this information obsolete? Chrome Canary builds use AutoFDO for ARMv7l (https://code.google.com/p/chromium/issues/detail?id=434587) It does not mean that the profile was recorded on an ARM system: they can gather perf.data on x86 and then produce a coverage file that is then used in ARM compiles. I tried it and seems to work well. I must say I did not even try running AutoFDO myself (so I am happy to hear it works). My understanding is that you need LBR only to get indirect call profiling working (i.e. you want to know from where the indirect function is called). Depending on your application this may not be the most important thing to record (either you don't have indirect calls in hot paths or they are handled resonably by speculative devirtualization) Some ARMs also has support for tracing jump pairs, right? Honza Sebastian
Re: AutoFDO profile toolchain is open-sourced
LBR is used for both cfg edge profiling and indirect call Target value profiling. I see, that makes sense ;) I guess if we want to support profile collection on targets w/o this feature we could still use one of the algorithms that try to guess edge profile from BB profile. Honza
Re: lto bootstrap fails.
On Fri, Apr 10, 2015 at 11:18:39AM -0400, Trevor Saunders wrote: On Fri, Apr 10, 2015 at 03:59:19PM +0200, Toon Moene wrote: Like this: https://gcc.gnu.org/ml/gcc-testresults/2015-04/msg01086.html ODR rears its head again ... huh, why is c/c-lang.h getting included in files linked into cc1plus? that seems strange. readelf -wl cc1plus | grep c-lang.h doesn't show anything. I tried to reproduce it and my bootstrap passes with same options as Toon's The following patch ought to be able to tell the particular translation unit causing the conflict. Index: tree.c === --- tree.c (revision 221977) +++ tree.c (working copy) @@ -4679,6 +4679,8 @@ build_translation_unit_decl (tree name) tree tu = build_decl (UNKNOWN_LOCATION, TRANSLATION_UNIT_DECL, name, NULL_TREE); TRANSLATION_UNIT_LANGUAGE (tu) = lang_hooks.name; + if (main_input_filename) +DECL_NAME (tu) = get_identifier (main_input_filename); vec_safe_push (all_translation_units, tu); return tu; } Index: ipa-devirt.c === --- ipa-devirt.c(revision 221977) +++ ipa-devirt.c(working copy) @@ -979,6 +979,21 @@ warn_odr (tree t1, tree t2, tree st1, tr return; inform (DECL_SOURCE_LOCATION (decl2), reason); + tree name = TYPE_NAME (t1); + tree name1 = decl2; + /* See if we have info about the translation unit. It may not be around + if types was already merged. */ +while (TREE_CODE (name) != TRANSLATION_UNIT_DECL) + name = TYPE_P (name) ? TYPE_CONTEXT (name) : DECL_CONTEXT (name); +while (TREE_CODE (name1) != TRANSLATION_UNIT_DECL) + name1 = TYPE_P (name1) ? TYPE_CONTEXT (name1) : DECL_CONTEXT (name1); +name = DECL_NAME (name); +name1 = DECL_NAME (name1); +if (name != name1 name name1) + inform (UNKNOWN_LOCATION, Conflicting compilation units: %s and %s, + IDENTIFIER_POINTER (name), + IDENTIFIER_POINTER (name1)); + if (warned) *warned = true; }
Re: debug-early branch merged into mainline
On 06/06/2015 03:33 PM, Jan Hubicka wrote: Aldy, also at PPC64le LTO bootstrap (at gcc112) dies with: ^ 0x104ae8f7 check_die ../../gcc/dwarf2out.c:5715 Hmmm... this is in the LTO/ltrans stage? If so, that's weird. The LTO path does not do the early DIE dance. Since check_die() is a new sanity check for DIEs, I wonder if this is a latent bug that was already there. How do I reproduce/do the LTO bootstrap thing? I simply bootstrap with configure --enable-languages=all,obj-c++,go --disable-werror --with-build-config=bootstrap-lt It is during ltrans stage. Honza Aldy 0x104e4e1b dwarf2out_decl ../../gcc/dwarf2out.c:21886 0x104d8a87 dwarf2out_abstract_function ../../gcc/dwarf2out.c:18457 0x104dd3e7 gen_inlined_subroutine_die ../../gcc/dwarf2out.c:19891 0x104e11f7 gen_block_die ../../gcc/dwarf2out.c:20933 0x104e17b3 decls_for_scope ../../gcc/dwarf2out.c:21006 0x104da907 gen_subprogram_die ../../gcc/dwarf2out.c:19167 0x104e343b gen_decl_die ../../gcc/dwarf2out.c:21409 0x104e4dfb dwarf2out_decl ../../gcc/dwarf2out.c:21882 0x104e4e6f dwarf2out_function_decl ../../gcc/dwarf2out.c:21894 0x105ab99f rest_of_handle_final ../../gcc/final.c:4505 0x105abd27 execute ../../gcc/final.c:4547 Please submit a full bug report, with preprocessed source if appropriate.
Re: debug-early branch merged into mainline
On 06/06/2015 03:33 PM, Jan Hubicka wrote: Aldy, also at PPC64le LTO bootstrap (at gcc112) dies with: ^ 0x104ae8f7 check_die ../../gcc/dwarf2out.c:5715 Hmmm... this is in the LTO/ltrans stage? If so, that's weird. The LTO path does not do the early DIE dance. Since check_die() is a new sanity check for DIEs, I wonder if this is a latent bug that was already there. How do I reproduce/do the LTO bootstrap thing? I simply bootstrap with configure --enable-languages=all,obj-c++,go --disable-werror --with-build-config=bootstrap-lt Ah, not full command line: configure --enable-languages=all,obj-c++,go --disable-werror --with-build-config=bootstrap-lto I will try your check die patch but probably not until Monday afternoon - have real life deadlines till then. Honza
Re: debug-early branch merged into mainline
Aldy, also at PPC64le LTO bootstrap (at gcc112) dies with: ^ 0x104ae8f7 check_die ../../gcc/dwarf2out.c:5715 0x104e4e1b dwarf2out_decl ../../gcc/dwarf2out.c:21886 0x104d8a87 dwarf2out_abstract_function ../../gcc/dwarf2out.c:18457 0x104dd3e7 gen_inlined_subroutine_die ../../gcc/dwarf2out.c:19891 0x104e11f7 gen_block_die ../../gcc/dwarf2out.c:20933 0x104e17b3 decls_for_scope ../../gcc/dwarf2out.c:21006 0x104da907 gen_subprogram_die ../../gcc/dwarf2out.c:19167 0x104e343b gen_decl_die ../../gcc/dwarf2out.c:21409 0x104e4dfb dwarf2out_decl ../../gcc/dwarf2out.c:21882 0x104e4e6f dwarf2out_function_decl ../../gcc/dwarf2out.c:21894 0x105ab99f rest_of_handle_final ../../gcc/final.c:4505 0x105abd27 execute ../../gcc/final.c:4547 Please submit a full bug report, with preprocessed source if appropriate.
GNU Tools Cauldron 2015 - Call for Participation
Hello, A reminder about this year's Cauldron (http://gcc.gnu.org/wiki/cauldron2015). The webpage was just updated with a list of accepted talks and BoFs as well as a brief review of the schedule to let you plan your travel. With 19 of very relevant talks and 10 BoFs we will have a busy schedule. We are still able to accept last minute submissions to be scheduled in parallel with other presentations. To register your abstract, send e-mail to tools-cauldron-ad...@googlegroups.com. Please do so early so we can adjust schedule accordingly. If you submitted an abstract but it is not in the list, please let us know, too. If you intend to participate, but not necessarily present, please let us know as soon as possible. This will allow us to plan for space and food. Send a message to tools-cauldron-ad...@googlegroups.com stating your intent to participate. Jan
Re: getrlimit compatibility issues
> @@ -811,5 +806,9 @@ lto_symtab_prevailing_decl (tree decl) >if (!ret) > return decl; > > + /* Do not replace a non-builtin with a builtin. */ > + if (is_builtin_fn (ret->decl)) > +return decl; > + Yep, this looks like a resonable direction. It will break the one declaration rule in a more wild sense than current frontends does so, because if a builtin win as a prevailing declaration, we end up with no merging at all. I wonder if we don't want to always prevail to first non-builtin variant? (and yep, i need to push out the syntatctic aliases patch which will make duplicated decls well behaved; will try to do so soonish) Honza
Re: getrlimit compatibility issues
> > Yep, this looks like a resonable direction. It will break the one > > declaration > > rule in a more wild sense than current frontends does so, because if a > > builtin > > win as a prevailing declaration, we end up with no merging at all. > > I wonder if we don't want to always prevail to first non-builtin variant? > > I think we already try. But this can be tricky as seen by the following > which is needed to finally fix LTO bootstrap on trunk ... > > Basically there are some function decls that do not go through > lto-symtab handling because they are not in any CUs cgraph but > only exist because they are refered to from BLOCK abstract origins > (and thus also LTRANS boundary streaming doesn't force them > into the cgraph). For example during LTO bootstrap I see > this from inlining of split functions where both split parts and > the original function are optimized away before stream-out. Yep, I remember poking around this when I finally stopped my attempts to get abstract origins working with LTO :( > > We hit > > static void > gen_inlined_subroutine_die (tree stmt, dw_die_ref context_die) > { > ... > /* Make sure any inlined functions are known to be inlineable. */ > gcc_checking_assert (DECL_ABSTRACT_P (decl) >|| cgraph_function_possibly_inlined_p (decl)); > > a lot without the following fixes. I suppose we _can_ merge > function decls which just differ in DECL_POSSIBLY_INLINED if > we make sure to "merge" the flag during tree merging. But > that's for a followup, below is for correctness. How this is going to work with early debug? If we merge one function to another, won't we move the references inside blocks to references to another block tree that is possibly incompatible with one we expect? > > We were also missing to compare DECL_ABSTRACT_ORIGIN during > tree merging (not sure if that caused any issue, I just run > into the clone abstract origin issue above and saw that). > > LTO bootstrapped and tested on x86_64-unknown-linux-gnu, I'll throw > it on regular bootstrap & test and then plan to commit it unless > you object that lto-symtab.c hunk... > > Richard. > > 2015-08-31 Richard Biener> > lto/ > * lto.c (compare_tree_sccs_1): Compare DECL_ABSTRACT_ORIGIN. > * lto-symtab.c (lto_symtab_merge): Merge DECL_POSSIBLY_INLINED flag. > (lto_symtab_prevailing_decl): Do not replace a decl that didn't > participate in merging with something else. > > Index: gcc/lto/lto.c > === > --- gcc/lto/lto.c (revision 227339) > +++ gcc/lto/lto.c (working copy) > @@ -1305,6 +1305,7 @@ compare_tree_sccs_1 (tree t1, tree t2, t >compare_tree_edges (DECL_SIZE (t1), DECL_SIZE (t2)); >compare_tree_edges (DECL_SIZE_UNIT (t1), DECL_SIZE_UNIT (t2)); >compare_tree_edges (DECL_ATTRIBUTES (t1), DECL_ATTRIBUTES (t2)); > + compare_tree_edges (DECL_ABSTRACT_ORIGIN (t1), DECL_ABSTRACT_ORIGIN > (t2)); >if ((code == VAR_DECL > || code == PARM_DECL) > && DECL_HAS_VALUE_EXPR_P (t1)) > Index: gcc/lto/lto-symtab.c > === > --- gcc/lto/lto-symtab.c (revision 227339) > +++ gcc/lto/lto-symtab.c (working copy) > @@ -312,6 +312,11 @@ lto_symtab_merge (symtab_node *prevailin > >if (TREE_CODE (decl) == FUNCTION_DECL) > { > + /* Merge decl state in both directions, we may still end up using > + the new decl. */ > + DECL_POSSIBLY_INLINED (prevailing_decl) |= DECL_POSSIBLY_INLINED > (decl); > + DECL_POSSIBLY_INLINED (decl) |= DECL_POSSIBLY_INLINED > (prevailing_decl); > + >if (warn_type_compatibility_p (TREE_TYPE (prevailing_decl), >TREE_TYPE (decl))) > return false; > @@ -798,6 +803,18 @@ lto_symtab_prevailing_decl (tree decl) >if (TREE_CODE (decl) == FUNCTION_DECL && DECL_ABSTRACT_P (decl)) > return decl; > > + /* When decl did not participate in symbol resolution leave it alone. > + This can happen when we streamed the decl as abstract origin > + from the block tree of inlining a partially inlined function. > + If all, the split function and the original function end up > + optimized away early we do not put the abstract origin into the > + ltrans boundary and we'll end up ICEing in > + dwarf2out.c:gen_inlined_subroutine_die because we eventually > + replace a decl with DECL_POSSIBLY_INLINED set with one without. */ > + if (TREE_CODE (decl) == FUNCTION_DECL > + && ! cgraph_node::get (decl)) > +return decl; So here you care about not merging DECLs that are not part of symtab with decls that are part of symtab? I have bit hard time to understand why that helps. The rest of patch makes sense to me modulo the possible issues with moving ECL_ABSTRACT_ORIGIN reference to an incompatible blob of dwarf info comming
Re: getrlimit compatibility issues
> On Tue, 1 Sep 2015, Jan Hubicka wrote: > > > > > Yep, this looks like a resonable direction. It will break the one > > > > declaration > > > > rule in a more wild sense than current frontends does so, because if a > > > > builtin > > > > win as a prevailing declaration, we end up with no merging at all. > > > > I wonder if we don't want to always prevail to first non-builtin > > > > variant? > > > > > > I think we already try. But this can be tricky as seen by the following > > > which is needed to finally fix LTO bootstrap on trunk ... > > > > > > Basically there are some function decls that do not go through > > > lto-symtab handling because they are not in any CUs cgraph but > > > only exist because they are refered to from BLOCK abstract origins > > > (and thus also LTRANS boundary streaming doesn't force them > > > into the cgraph). For example during LTO bootstrap I see > > > this from inlining of split functions where both split parts and > > > the original function are optimized away before stream-out. > > > > Yep, I remember poking around this when I finally stopped my attempts > > to get abstract origins working with LTO :( > > > > > > We hit > > > > > > static void > > > gen_inlined_subroutine_die (tree stmt, dw_die_ref context_die) > > > { > > > ... > > > /* Make sure any inlined functions are known to be inlineable. */ > > > gcc_checking_assert (DECL_ABSTRACT_P (decl) > > >|| cgraph_function_possibly_inlined_p (decl)); > > > > > > a lot without the following fixes. I suppose we _can_ merge > > > function decls which just differ in DECL_POSSIBLY_INLINED if > > > we make sure to "merge" the flag during tree merging. But > > > that's for a followup, below is for correctness. > > > > How this is going to work with early debug? If we merge one function to > > another, won't we move the references inside blocks to references to another > > block tree that is possibly incompatible with one we expect? > > Hmm, indeed, if we merge the abstract instance FUNCTION_DECL we end > up adjusting the abstract origin DIE reference accordingly but as > we don't merge BLOCKs the actual BLOCK abstract origins would refer > to the "original" early debug abstract instance. Not sure if that > will cause problems for the debug consumer - the abstract instances > should be the same (literally, unless you play tricks with #ifdefs > in different uses...). > > > The rest of patch makes sense to me modulo the possible issues with moving > > ECL_ABSTRACT_ORIGIN reference to an incompatible blob of dwarf info comming > > from other instance of the inlined function from other translation unit. > > > > I see you added DECL_ABSTRACT_ORIGIN comparer, but that won't make block > > numbers > > to match. > > Who cares about block numbers? Note the abstract origin comparer was > for the clone abstract origins we have. Yep, no one in dwarf2out. My original attempt used it to refer to BLOCKs within a different function body (BLOCK_ABSTRACT_ORIGIN). We currently don't stream it, so we don't need to refer to it. In longer run we will however need a way to make these references to work... Merging them will be tricky as it would require comparing the function's block trees. Especially after early inlining these tends to diverge, but perhaps the early debug is output before early inlining and thus this problem goes away and we only care about different ifdefs. Those are quite common, too, however, at least in Firefox :( Honza > > Richard.
Re: ipa vrp implementation in gcc
> On Mon, Jan 11, 2016 at 1:38 AM, Kugan > <kugan.vivekanandara...@linaro.org> wrote: > > Hi All, > > > > I am looking at implementing a ipa vrp pass. Jan Hubicka also talks > > about this in 2013 GNU Cauldron as one of the optimization he would like > > to see in gcc. So my question is, is any one implementing it. If not we > > would like to do that. > > > > I also looked at the ipa-cp implementation to see how this can be done. > > Going by this, one of the way to implement this is (skipping all the > > details): > > > > - Have an early tree-vrp so that we can have value ranges for parameters > > at call sites. > > I'd rather use the IPA analysis phase for this and use a VRP algorithm > that doesn't require ASSERT_EXPR insertion. Another potential use of value ranges is the profile estimation. http://www.lighterra.com/papers/valuerangeprop/Patterson1995-ValueRangeProp.pdf It seems to me that we may want to have something that can feed sane loop bounds for profile estimation as well and we can easily store the known value ranges to SSA name annotations. So I think separate local pass to compute value ranges (perhaps with less accuracy than full blown VRP) is desirable. > > > - Create jump functions that captures the value ranges of call sites > > propagate the value ranges. In 2013 talk, Jan Hubicka talks about > > > > - Modifying ipa-prop.[h|c] to handles this but wouldn't it be easier to > > have its own and much more simpler implementation ? > > No idea. I think the ipa-prop.c probably won't need any siginificant changes. The code basically analyze what values are passed thorugh the function and this works for constants as well as for intervals. In fact ipa-cp already uses the same ipa-prop analysis for 1) constant propagation 2) alignment propagation 3) propagation of known polymorphic call contextes. So replacing 1) by value range propagation should be easily doable. I would also like to replace alignment propagation by bitwise constant propagation (i.e. propagating what bits are known to be zero and what bits are known to be one). We already do have bitwise CCP, so we could deal with this basically in the same way as we deal with value ranges. ipa-prop could use bit of clenaing up and modularizing that I hope will be done next stage1 :) > > > - Once we have the value ranges for parameter/return values, we could > > rely on tree-vrp to use this and do the optimizations > > Yep. IPA transform phase should annotate parameter default defs with > computed ranges. Yep, in addition we will end up with known value ranges stored in aggregates for that we need better separate representaiton See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68930 > > > Does this make any sense? Any thoughts/suggestions to work on this is > > highly appreciated. > > IPA alignment propagation should already be somewhat similar as in doing > an intersection step during propagation. The merge operation is same as VRP does, but yep, IPA alignment code is good one to follow for start. Martin implemented it in a bit limited way so it never performs cloning, but we can deal with that incrementally. > > Richard. > > > Thanks, > > Kugan
Re: help with PR69133
> On Mon, Jan 18, 2016 at 10:28 AM, Prathamesh Kulkarni >wrote: > > On 17 January 2016 at 14:56, Prathamesh Kulkarni > > wrote: > >> Hi, > >> I was having a look at PR69133. > >> It appears that with -flto-partition=none, > >> cgraph_node::get_untransformed_body () > >> is called twice for node with asm_name _ZThn4_N11xercesc_3_11C5m_fn6ERKi. > >> c++filt says it is: non-virtual thunk to xercesc_3_1::C::m_fn6(int const&) > >> > >> get_untransformed_body() calls > >> lto_free_function_in_decl_state_for_node() which sets > >> node->lto_file_data to NULL. > >> Now 2nd time get_untransformed_body is called for the same node, > >> node->lto_file_data is NULL and lto_get_decl_name_mapping() dereferences > >> lto_file_data which results in segfault. > >> > >> I was wondering if gating on lto_file_data could be a reasonable solution ? > >> if (!lto_file_data) > >> return false; > >> I am not sure what value to return for this case (I chose false > >> because of early exit). > >> It prevents the ICE, and the patch passes bootstrap+test on > >> x86_64-unknown-linux-gnu. > >> With partitoning enabled, for the above test-case, > >> get_untransformed_body () is called only once per node. However I > >> don't understand why it gets called twice with -flto-partition=none. > > Kugan pointed out to me the test-case doesn't ICE by passing -fno-ipa-icf. > > It looks to me we are not supposed to call this twice on the same node > but nothing guards against that (and we don't release the untransformed > body either). -flto-partition=none is special here becaue it combines WPA You are suppose to read every function body just once from LTO stream. Doing it multiple times is a waste. You can call get_body and get_untransformed_body as many times as you want as tHere is guard: bool cgraph_node::get_untransformed_body (void) { lto_file_decl_data *file_data; const char *data, *name; size_t len; tree decl = this->decl; if (DECL_RESULT (decl)) return false; once you read the body DECL_RESULT is set. Only way you can get past this to ICE should be the case where you read the body and then trhow it away. > and late stage. IPA ICF is the only IPA pass needing untransformed bodies > during WPA. That is not exactly true. We also read untransformed bodies when merging profiles. I will check the PR. ipa-icf is not supposed to get body of thunks. Honza
Re: ipa vrp implementation in gcc
> On Mon, Jan 18, 2016 at 12:00 AM, Kugan >wrote: > > Hi, > > > >> Another potential use of value ranges is the profile estimation. > >> http://www.lighterra.com/papers/valuerangeprop/Patterson1995-ValueRangeProp.pdf > >> It seems to me that we may want to have something that can feed sane loop > >> bounds for profile estimation as well and we can easily store the known > >> value ranges to SSA name annotations. > >> So I think separate local pass to compute value ranges (perhaps with less > >> accuracy than full blown VRP) is desirable. > > > > Thanks for the reference. I am looking at implementing a local pass for > > VRP. The value range computation in tree-vrp is based on the above > > reference and uses ASSERT_EXPR insertion (I understand that you posted > > the reference above for profile estimation). As Richard mentioned in his > > reply, the local pass should not rely on ASSERT_EXPR insertion. > > Therefore, do you have any specific algorithm in mind (i.e. Any > > published paper or reference from book)?. Of course we can tweak the > > algorithm from the reference above but would like to understand what > > your intension are. > > I have (very incomplete) prototype patches to do a dominator-based > approach instead (what is refered to downthread as non-iterating approach). > That's cheaper and is what I'd like to provide as an "utility style" interface > to things liker niter analysis which need range-info based on a specific > dominator (the loop header for example). In general, given that we have existing VRP implementation I would suggest first implementing the IPA propagation and profile estimation bits using existing VRP pass and then try to compare the simple dominator based approach with the VRP we have and see what are the compile time/code quality effects of both. Based on that we can decide how complex VRP we really want. It will be probably also more fun to implement it this way :) I plan to collect some data on early VRP and firefox today or tomorrow. Honza > > Richard. > > >> I think the ipa-prop.c probably won't need any siginificant changes. The > >> code basically analyze what values are passed thorugh the function and > >> this works for constants as well as for intervals. In fact ipa-cp already > >> uses the same ipa-prop analysis for > >> 1) constant propagation > >> 2) alignment propagation > >> 3) propagation of known polymorphic call contextes. > >> > >> So replacing 1) by value range propagation should be easily doable. > >> I would also like to replace alignment propagation by bitwise constant > >> propagation (i.e. propagating what bits are known to be zero and what > >> bits are known to be one). We already do have bitwise CCP, so we could > >> deal with this basically in the same way as we deal with value ranges. > >> > >> ipa-prop could use bit of clenaing up and modularizing that I hope will > >> be done next stage1 :) > > > > We (Myself and Prathamesh) are interested in working on LTO > > improvements. Let us have a look at this. > > > >>> > - Once we have the value ranges for parameter/return values, we could > rely on tree-vrp to use this and do the optimizations > >>> > >>> Yep. IPA transform phase should annotate parameter default defs with > >>> computed ranges. > >> > >> Yep, in addition we will end up with known value ranges stored in > >> aggregates > >> for that we need better separate representaiton > >> > >> See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68930 > >>> > > > > Thanks, > > Kugan
Re: r235766 incomplete?
> On Mon, 2016-05-02 at 11:50 -0600, Martin Sebor wrote: > > Hi Jan, > > > > I just noticed the compilation errors in the attached file with > > the latest trunk. It seems as though your recent patch below may > > be incomplete: > > > >commit 46e5dccc6f188bd0fd5af4e9778f547ab63c9cae > >Author: hubicka <hubicka@138bc75d-0d04-0410-961f-82ee72b054a4> > >Date: Mon May 2 16:55:56 2016 + > > > > The following change causes compilation errors due to > > ipa_find_agg_cst_for_param taking just three arguments, while it > > is being called with four. (I haven't looked into the other error.) > > > > Regards > > Martin > > > > --- a/gcc/ipa-inline-analysis.c > > +++ b/gcc/ipa-inline-analysis.c > > @@ -850,7 +850,8 @@ evaluate_conditions_for_known_args (struct > > cgraph_node *node > > , > >if (known_aggs.exists ()) > > { > >agg = known_aggs[c->operand_num]; > > - val = ipa_find_agg_cst_for_param (agg, c->offset, c > > ->by_ref); > > + val = ipa_find_agg_cst_for_param (agg, > > known_vals[c->operand_num], > > + c->offset, c > > ->by_ref); > > I saw this too (with r235766). I believe it's fixed by r235770 and > r235771: > > 2016-05-02 Jan Hubicka <hubi...@ucw.cz> > > * cif-code.def (CIF_THUNK): Add. > * ipa-inline-analsysis.c (evaluate_conditions_for_known_args): Revert > accidental change. > > (albeit with a typo in that second filename) Uh, thanks. Will fix that with next commit. I amanaged to accidentaly bundle unrelated changes to the patch. My apologizes for that. Will try to keep my commit tree clean. Honza
Re: How to check if two symbols are from same source unit during WPA ?
> On Thu, 14 Apr 2016, Ramana Radhakrishnan wrote: > > > > > > > > > What happens in practice? GCC doesn't put functions in random > > > partitions. > > > > > > > The data goes into a separate partition AFAIU - it means that all data > > accesses are as though they are extern references which means there's > > not necessarily any CSE'ing ability that's available with section > > anchors. > > No, they are added to partitions referencing them. Actually they > end up in the first partition that references them. Yeah, this can be easily tuned to distribute variables last and put them to partitioning referncing them most often. Balanced partitioning tries to preserve source order (unless FDO tells otehrwise) and thus we could add anchros into the cost metrics it uses to split partitions. > > > >> If it's not desired by default could we gate it on an option ? > > >> AFIAU, section anchors optimization is important for ARM and AArch64. > > > > > > For code size or for performance? I wonder why section anchors cannot > > > be "implemented" using some special relocations and thus as linker > > > optimization. > > > > For performance (and probably code size too) as you can end up CSE'ing > > the anchor point. the difference in performance with -flto-partitions=1 > > is visible on quite a few of the spec2k benchmarks. I don't remember > > which ones immediately but yeah it makes a difference. > > Yeah, as said elsewhere for things like spec2k we should have been > less aggressive with the goal to parallelize and build larger > partitions for small programs. Thus increase lto-min-partition > up to a point where all benchmarks end up in a single partition > (small benchmarks, so spec2k6 doesn't count at all). After all > our inlining limits only start with large-unit-insns as well, > which is 10 times of lto-min-partition... Well, I don't think large-unit-insns is particularly related to lto-min-partition. But lto-min-partition was not really tuned at all (I remember I dumped the value for some smal spec2k benchmark and based it on that). We could do some experiments. But for larger programs we want more stable solution. Read/write globals are not that common, but readonly globals are common in modern and large codebases, too. Posibility is to assign achors at WPA time disabling any chance for variables to be optimized out later. Honza > > Richard.
Re: [RFC] noipa attribute (was Re: How to avoid constant propagation into functions?)
> Hi! > > So here is a proof of concept of an attribute that disables inlining, > cloning, ICF, IPA VRP, IPA bit CCP, IPA RA, pure/const/throw discovery. > Does it look reasonable? Anything still missing? I think you also want to disable optimizations we do about local functions (stack alignment propagation, calling convention changes), IPA-SRA, IPA-PTA and possibly ipa-split. If you want it to be bi-directional it is also necessary to prevent inlining into the function with attribute and similarly avoiding other results of IPA analysis. Honza > No testsuite coverage yet, I bet we'd want to check for all those opts and > see that they aren't happening if the attribute is present. > > 2016-12-15 Jakub Jelinek> > * attribs.c (decl_attributes): Imply noinline, noclone, no_icf and > used attributes for noipa attribute. For naked attribute use > lookup_attribute first before lookup_attribute_spec. > * final.c (rest_of_handle_final): Disable IPA RA for functions with > noipa attribute. > * cgraph.c (cgraph_node::get_availability): Return AVAIL_INTERPOSABLE > for functions with noipa attribute. > * doc/extend.texi: Document noipa function attribute. > c-family/ > * c-attribs.c (c_common_attribute_table): Add noipa attribute. > (handle_noipa_attribute): New function. > > --- gcc/attribs.c.jj 2016-10-31 13:28:05.0 +0100 > +++ gcc/attribs.c 2016-12-15 10:50:54.809973594 +0100 > @@ -404,8 +404,8 @@ decl_attributes (tree *node, tree attrib > those targets that support it. */ >if (TREE_CODE (*node) == FUNCTION_DECL >&& attributes > - && lookup_attribute_spec (get_identifier ("naked")) > - && lookup_attribute ("naked", attributes) != NULL) > + && lookup_attribute ("naked", attributes) != NULL > + && lookup_attribute_spec (get_identifier ("naked"))) > { >if (lookup_attribute ("noinline", attributes) == NULL) > attributes = tree_cons (get_identifier ("noinline"), NULL, attributes); > @@ -414,6 +414,26 @@ decl_attributes (tree *node, tree attrib > attributes = tree_cons (get_identifier ("noclone"), NULL, attributes); > } > > + /* A "noipa" function attribute implies "noinline", "noclone", "no_icf" and > + "used" for those targets that support it. */ > + if (TREE_CODE (*node) == FUNCTION_DECL > + && attributes > + && lookup_attribute ("noipa", attributes) != NULL > + && lookup_attribute_spec (get_identifier ("noipa"))) > +{ > + if (lookup_attribute ("noinline", attributes) == NULL) > + attributes = tree_cons (get_identifier ("noinline"), NULL, attributes); > + > + if (lookup_attribute ("noclone", attributes) == NULL) > + attributes = tree_cons (get_identifier ("noclone"), NULL, attributes); > + > + if (lookup_attribute ("no_icf", attributes) == NULL) > + attributes = tree_cons (get_identifier ("no_icf"), NULL, attributes); > + > + if (lookup_attribute ("used", attributes) == NULL) > + attributes = tree_cons (get_identifier ("used"), NULL, attributes); > +} > + >targetm.insert_attributes (*node, ); > >for (a = attributes; a; a = TREE_CHAIN (a)) > --- gcc/final.c.jj2016-11-25 09:49:47.0 +0100 > +++ gcc/final.c 2016-12-15 11:38:10.660080949 +0100 > @@ -4473,7 +4473,8 @@ rest_of_handle_final (void) >assemble_start_function (current_function_decl, fnname); >final_start_function (get_insns (), asm_out_file, optimize); >final (get_insns (), asm_out_file, optimize); > - if (flag_ipa_ra) > + if (flag_ipa_ra > + && !lookup_attribute ("noipa", DECL_ATTRIBUTES > (current_function_decl))) > collect_fn_hard_reg_usage (); >final_end_function (); > > --- gcc/cgraph.c.jj 2016-12-07 20:10:16.0 +0100 > +++ gcc/cgraph.c 2016-12-15 12:20:11.449481168 +0100 > @@ -2255,7 +2255,8 @@ cgraph_node::get_availability (symtab_no > avail = AVAIL_AVAILABLE; >else if (transparent_alias) > ultimate_alias_target (, ref); > - else if (lookup_attribute ("ifunc", DECL_ATTRIBUTES (decl))) > + else if (lookup_attribute ("ifunc", DECL_ATTRIBUTES (decl)) > +|| lookup_attribute ("noipa", DECL_ATTRIBUTES (decl))) > avail = AVAIL_INTERPOSABLE; >else if (!externally_visible) > avail = AVAIL_AVAILABLE; > --- gcc/doc/extend.texi.jj2016-12-15 11:26:07.0 +0100 > +++ gcc/doc/extend.texi 2016-12-15 12:19:32.738996533 +0100 > @@ -2955,6 +2955,15 @@ asm (""); > (@pxref{Extended Asm}) in the called function, to serve as a special > side-effect. > > +@item noipa > +@cindex @code{noipa} function attribute > +Disable interprocedural optimizations between the function with this > +attribute and its callers, as if the body of the function is not available > +when optimizing callers and the callers are unavailable when optimizing > +the body. This attribute implies @code{noinline}, @code{noclone}, >
Re: Worse code after bbro?
> On 01/04/2017 03:46 AM, Segher Boessenkool wrote: > >On Wed, Jan 04, 2017 at 10:05:49AM +0100, Richard Biener wrote: > >>>The code size is identical, but the trunk version executes one more > >>>instruction everytime the loop runs (explicit jump to .L5 with trunk vs > >>>fallthrough with 4.8) - it's faster only if the loop never runs. This > >>>happens irrespective of the memory clobber inline assembler statement. > > > >With -Os you've asked for smaller code, not faster code. > > > >All of the block reordering is based on heuristics -- there is no polynomial > >time and space algorithm to do it optimally, let alone the linear time and > >space we need in GCC -- so there always will be cases we do not handle > >optimally. -Os does not get as much attention as -O2 etc., as well. > > > >OTOH this seems to be a pretty common case that we could handle. Please > >open a PR to keep track of this? > I superficially looked at this a little while ago and concluded that it's > something we ought to be able to handle. However, it wasn't critical enough > to me to get familiar enough with the bbro code to deeply analyze -- thus I > put it into my gcc-8 queue. The heuristics should handle such simple case just fine. I guess some bug crept in during the years. > > > > > >>I belive that doing BB reorder in CFG layout mode is fundamentally > >>flawed but I guess it's wired up so that out-of-CFG layout honors > >>EDGE_FALLTHRU. > > > >Why is this fundamentally flawed? The reordering is much easier this way. > Agreed (that we ought to be doing reordering in CFG layout mode). In fact cfglayout was invented to implement bb-reorder originally :) Honza > > Jeff
Re: Question about inlining and strict_aliasing
> Hello. > > I've been working on a patch that would cope with target and optimization > (read PerFunction) > in a proper way. I came to following test-case (slightly modified > ./gcc/testsuite/gcc.c-torture/execute/alias-1.c): > > int val; > > int *ptr = > float *ptr2 = > > static > __attribute__((always_inline, optimize ("-fno-strict-aliasing"))) > typepun () > { > *ptr2=0; > } > > main() > { > *ptr=1; > typepun (); > if (*ptr) > __builtin_abort (); > } > > $ gcc -O2 /tmp/always-inline.c -fdump-tree-all-details && ./a.out > Aborted (core dumped) > > Problem is that einline does: > > Inlining typepun into main (always_inline). >Estimating body: typepun/3 >Known to be false: not inlined >size:2 time:2 > Dropping flag_strict_aliasing on main:4 << here is strict_aliasing drop > Accounting size:2.00, time:2.00 on predicate:(true) > > However fre1 does not still check for flag_strict_aliasing. Is it bug or not? > I did an experimental fix, but I guess there will me multiple places where > the > flag is checked. Aha, I think the problem is that once we modify strict aliasing of cfun, we need to re-set the global optimization attributes, becuase those are still copies from the original declaration. > > Second question: > Current ipa-inline.c code contains question sophisticated > check_{match,maybe_up,maybe_down} and > it's following by: > > /* When user added an attribute to the callee honor it. */ > else if (lookup_attribute ("optimize", DECL_ATTRIBUTES (callee->decl)) > && opts_for_fn (caller->decl) != opts_for_fn (callee->decl)) > { > e->inline_failed = CIF_OPTIMIZATION_MISMATCH; > inlinable = false; > } > > I think it's very strict, as when one uses __attribute__((optimize)) with a > param that matches with > command line arguments, it will return false. What if we remove the hunk? I am not quite sure. If you have function A with optimization attribute and you inline it to function b with same command line parameters, it may enable to inline it into function C where A would not be inlined because of the explicit optimization attribute. If we relax this check, we will need to copy the optimization attribute from A to B at the time we inline into B. > > Thanks, > Martin > diff --git a/gcc/tree-ssa-alias.c b/gcc/tree-ssa-alias.c > index 871fa121fd0..da38e4fab69 100644 > --- a/gcc/tree-ssa-alias.c > +++ b/gcc/tree-ssa-alias.c > @@ -986,7 +986,7 @@ ncr_compar (const void *field1_, const void *field2_) > static bool > nonoverlapping_component_refs_p (const_tree x, const_tree y) > { > - if (!flag_strict_aliasing > + if (!opts_for_fn (current_function_decl)->x_flag_strict_aliasing This should not be necessary. Einline ought to update global opts. I suppose when decl modified is cfun one needs to reset current_funcion_decl to NULL and then set it again so the global flags changing machinery notices this change. Honza t >|| !x || !y >|| TREE_CODE (x) != COMPONENT_REF >|| TREE_CODE (y) != COMPONENT_REF) > @@ -1167,7 +1167,7 @@ indirect_ref_may_alias_decl_p (tree ref1 > ATTRIBUTE_UNUSED, tree base1, > return false; > >/* Disambiguations that rely on strict aliasing rules follow. */ > - if (!flag_strict_aliasing || !tbaa_p) > + if (!opts_for_fn (current_function_decl)->x_flag_strict_aliasing || > !tbaa_p) > return true; > >ptrtype1 = TREE_TYPE (TREE_OPERAND (base1, 1)); > @@ -1334,7 +1334,7 @@ indirect_refs_may_alias_p (tree ref1 ATTRIBUTE_UNUSED, > tree base1, > return false; > >/* Disambiguations that rely on strict aliasing rules follow. */ > - if (!flag_strict_aliasing || !tbaa_p) > + if (!opts_for_fn (current_function_decl)->x_flag_strict_aliasing || > !tbaa_p) > return true; > >ptrtype1 = TREE_TYPE (TREE_OPERAND (base1, 1)); > @@ -1508,7 +1508,7 @@ refs_may_alias_p_1 (ao_ref *ref1, ao_ref *ref2, bool > tbaa_p) > >/* First defer to TBAA if possible. */ >if (tbaa_p > - && flag_strict_aliasing > + && opts_for_fn (current_function_decl)->x_flag_strict_aliasing >&& !alias_sets_conflict_p (ao_ref_alias_set (ref1), >ao_ref_alias_set (ref2))) > return false;
GNU Tools Cauldron 2017, Prague, September 8-10
Hi, we are very pleased to invite you all the GNU Tools Cauldron on 8-10 September 2017. This year we will meet again in Prague, at Charles University. Details are here: https://gcc.gnu.org/wiki/cauldron2017 As usual, please register (capacity is limited), send abstracts and ask administrivia questions to tools-cauldron-ad...@googlegroups.com. I plan to contact GCC, GDB, binutils, CGEN, DejaGnu, newlib and glibc mailing lists. Please feel free to share with any other groups that are appropriate. Looking forward to see you in Prague, Honza Hubicka
GNU Tools Cauldron 2017, September 8-10, 2017
Hello, I would like to remind you of the upcoming GNU Tools Cauldron meeting in Prague September 8-10. https://gcc.gnu.org/wiki/cauldron2017 (in 5 weeks!) At the present we still have 83 registered participants and the room capacity is bit over 100 seats. If you want to participate, please register soon. We also have 35 registered proposals for talks and BoFs which will make it a busy 3 days. (those which were submitted with title and abstract are now available at https://gcc.gnu.org/wiki/cauldron2017#Abstracts) Because there will be two tracks, we are still open for last minute talk and BoF proposals. If you have a topic that you would like to present, please submit an abstract describing what you plan to present. We are accepting three types of submissions: Prepared presentations: demos, project reports, etc. BoFs: coordination meetings with other developers. Tutorials for developers. No user tutorials, please. To register your abstract, send e-mail to tools-cauldron-ad...@googlegroups.com Your submission should contain the following information: Title: Authors: Abstract: If you intend to participate, but not necessarily present, please let us know as well. Send a message to tools-cauldron-ad...@googlegroups.com stating your intent to participate. Please indicate your affiliation, dietary requirements and t-shirt size. Looking forward to see you soon in Prague, Jan
Re: .bad_compare in bootstrap of head
Hi, I found the problem. It was pretty obvious - we compute sum of times twice. Once when computing statement sizes and second time by summing the summaries. because sreal is not distributive, it leads to different results. I have comitted the following patch. Incrementally I will drop the code duplication. Bootstrapped/regtested x86_64-linux Honza --- trunk/gcc/ipa-inline-analysis.c 2017/05/07 19:41:09 247728 +++ trunk/gcc/ipa-inline-analysis.c 2017/05/07 22:21:05 247729 @@ -3119,12 +3120,13 @@ info->size = info->self_size; info->stack_frame_offset = 0; info->estimated_stack_size = info->estimated_self_stack_size; - if (flag_checking) -{ - inline_update_overall_summary (node); - gcc_assert (!(info->time - info->self_time).to_int () - && info->size == info->self_size); -} + + /* Code above should compute exactly the same result as + inline_update_overall_summary but because computation happens in + different order the roundoff errors result in slight changes. */ + inline_update_overall_summary (node); + gcc_assert (!(info->time - info->self_time).to_int () + && info->size == info->self_size); }
Re: .bad_compare in bootstrap of head
> > On 05/04/2017 08:31 AM, Jeff Law wrote: > > >On 05/04/2017 07:26 AM, Дмитрий Дьяченко wrote: > > >>Fedora 26 x86_64 > > >>r247595 > > >> > > >>~/src/gcc_current/configure --prefix=/usr/local/gcc_current > > >>--enable-static --enable-checking=no --enable-languages=c,c++,lto > > >>--enable-plugin --disable-multilib > > >And that may be the key, I've been able to get a similar failure. Could > > >easy be the checking flags. I'll do some bisecting here. > > >jeff > > > > It's Jan's patch from April 30. > > > > commit e062e35c7ff7743aa31868370bf5cb9f8da835dd > > Author: hubicka> > Date: Sun Apr 30 15:02:11 2017 + > > I wonder how that patch can cause mismatches. Does it reproduce on one of > compile farm machines (my x86-64 bootstrap works fine so does ia64 on terbium > after fixing the gcc 4.1 issue yeterday). > It would be great to have -fdump-ipa-inline-details dumps of the mismatching > run. Perhaps it will be easy to spot the problem. Aha, --enable-checking=no. I will check what effect it have. Honza
Re: .bad_compare in bootstrap of head
> On 05/04/2017 08:31 AM, Jeff Law wrote: > >On 05/04/2017 07:26 AM, Дмитрий Дьяченко wrote: > >>Fedora 26 x86_64 > >>r247595 > >> > >>~/src/gcc_current/configure --prefix=/usr/local/gcc_current > >>--enable-static --enable-checking=no --enable-languages=c,c++,lto > >>--enable-plugin --disable-multilib > >And that may be the key, I've been able to get a similar failure. Could > >easy be the checking flags. I'll do some bisecting here. > >jeff > > It's Jan's patch from April 30. > > commit e062e35c7ff7743aa31868370bf5cb9f8da835dd > Author: hubicka> Date: Sun Apr 30 15:02:11 2017 + I wonder how that patch can cause mismatches. Does it reproduce on one of compile farm machines (my x86-64 bootstrap works fine so does ia64 on terbium after fixing the gcc 4.1 issue yeterday). It would be great to have -fdump-ipa-inline-details dumps of the mismatching run. Perhaps it will be easy to spot the problem. Honza > > PR ipa/79224 > * ipa-inline-analysis.c (dump_predicate): Add optional parameter NL. > (account_size_time): Use two predicates - exec_pred and > nonconst_pred_ptr. > (evaluate_conditions_for_known_args): Compute both clause and > nonspec_clause. > (evaluate_properties_for_edge): Evaulate both clause and > nonspec_clause. > (inline_summary_t::duplicate): Update. > (estimate_function_body_sizes): Caluculate exec and nonconst > predicates > separately. > (compute_inline_parameters): Likewise. > (estimate_edge_size_and_time): Update caluclation of time. > (estimate_node_size_and_time): Compute both time and nonspecialized > time. > (estimate_ipcp_clone_size_and_time): Update. > (inline_merge_summary): Update. > (do_estimate_edge_time): Update. > (do_estimate_edge_size): Update. > (do_estimate_edge_hints): Update. > (inline_read_section, inline_write_summary): Stream both new > predicates. > * ipa-inline.c (compute_uninlined_call_time): Take > uninlined_call_time > as argument. > (compute_inlined_call_time): Cleanup. > (big_speedup_p): Update. > (edge_badness): Update. > * ipa-inline.h (INLINE_TIME_SCALE): Remove. > (size_time_entry): Replace predicate by exec_predicate and > nonconst_predicate. > (edge_growth_cache_entry): Cache both time nad nonspecialized time. > (estimate_edge_time): Return also nonspec_time. > (reset_edge_growth_cache): Update. >
Re: .bad_compare in bootstrap of head
> > I wonder how that patch can cause mismatches. Does it reproduce on one of > > compile farm machines (my x86-64 bootstrap works fine so does ia64 on > > terbium > > after fixing the gcc 4.1 issue yeterday). > > It would be great to have -fdump-ipa-inline-details dumps of the mismatching > > run. Perhaps it will be easy to spot the problem. > > Aha, --enable-checking=no. I will check what effect it have. It does reproduce for me. Time estimates are very slightly different between checking and non-checking build. I am investigating why. Honza > > Honza
Re: RFC: Improving GCC8 default option settings
> On Wed, Sep 13, 2017 at 3:21 AM, Michael Clarkwrote: > > > >> On 13 Sep 2017, at 1:15 PM, Michael Clark wrote: > >> > >> - https://rv8.io/bench#optimisation > >> - https://rv8.io/bench#executable-file-sizes > >> > >> -O2 is 98% perf of -O3 on x86-64 > >> -Os is 81% perf of -O3 on x86-64 > >> > >> -O2 saves 5% space on -O3 on x86-64 > >> -Os saves 8% space on -Os on x86-64 > >> > >> 17% drop in performance for 3% saving in space is not a good trade for a > >> “general” size optimisation. It’s more like executable compression. > > > > Sorry fixed typo: > > > > -O2 is 98% perf of -O3 on x86-64 > > -Os is 81% perf of -O3 on x86-64 > > > > -O2 saves 5% space on -O3 on x86-64 > > -Os saves 8% space on -O3 on x86-64 I am bit surprised you see only 8% of code size difference for -Os and -O3. I look into these numbers occasionally and it is usualy well over two digit number. http://hubicka.blogspot.cz/2014/04/linktime-optimization-in-gcc-2-firefox.html http://hubicka.blogspot.cz/2014/09/linktime-optimization-in-gcc-part-3.html has 42% code segment size reduction for Firefox and 19% for libreoffice Honza
Re: RFC: Improving GCC8 default option settings
> >I don't see static profile prediction to be very useful here to find > >"really > >hot code" - neither in current implementation or future. The problem of > >-O2 is that we kind of know that only 10% of code somewhere matters for > >performance but we have no way to reliably identify it. > > It's hard to do better than statically look at (ipa) loop depth. But > shouldn't that be good enough? Only if you assume that you have whole program and understand indirect calls. There are some stats on this here http://ieeexplore.ieee.org/document/717399/ It shows that propagating static profile across whole progrma (which is just tiny bit more fancy than counting loop depth) sort of work statistically. I really do not have very high hopes of this reliably working in production compiler. We already have PRs for single function benchmark where deep loop nest is used ininitialization or so and the actual hard working part has small loop nest & gets identified as cold. As soon as you start propagating in whole program context, such local mistakes will become more comon. > > > > >I would make sense to have less agressive vectoriazaoitn at -O2 and > >more at > >-Ofast/-O3. > > We tried that but the runtime effects were not offsetting the compile time > cost. Yep, i remember that. Honza
Re: RFC: Improving GCC8 default option settings
> On Wed, Sep 13, 2017 at 3:46 PM, Jakub Jelinekwrote: > > On Wed, Sep 13, 2017 at 03:41:19PM +0200, Richard Biener wrote: > >> On its own -O3 doesn't add much (some loop opts and slightly more > >> aggressive inlining/unrolling), so whatever it does we > >> should consider doing at -O2 eventually. > > > > Well, -O3 adds vectorization, which we don't enable at -O2 by default. > > As said, -fprofile-use enables it so -O2 should eventually do the same > for "really hot code". I don't see static profile prediction to be very useful here to find "really hot code" - neither in current implementation or future. The problem of -O2 is that we kind of know that only 10% of code somewhere matters for performance but we have no way to reliably identify it. I would make sense to have less agressive vectoriazaoitn at -O2 and more at -Ofast/-O3. Adding -Os and -Oz would make sense to me - even with hot/cold info it is not desriable to optimize as agressively for size as we do becuase mistakes happen and one do not want to make code paths 1000 times slower to save one byte of binary. We could handle this gratefully internally by having logic for "known to be cold" and "guessed to be cold". New profile code can make difference in this. Honza > > Richard. > > > Jakub
Re: GNU Tools Cauldron 2017: OMP (Offloading and Multi Processing) BoF
> Hi! > > As Honza told me recently, it has been proposed by Martin -- I don't know > which one ;-) -- and certainly makes sense, to have another OMP > (Offloading and Multi Processing) BoF at the GNU Tools Cauldron 2017, > which is currently scheduled for Saturday, 11:00 to 11:45. That is, a > general OMP BoF session in addition to more specific individual > presentations. > > Honza volunteered me to host it ;-) -- to which I'm not immediately To be precise, I also volunteered Martin (Jambor) to propose it. In any case it seems it would be useful thing to have on programme. Let me know if I should keep it there ;) Honza > objecting, but my time to prepare anything at all is very limited > (non-existing?) right now, and I still got to first organize stuff/myself > for my other presentation slot... > > We could do it off the cuff, but it'd also be very useful to know > beforehand a handful of topics to talk about. So, who's interested in > contributing to this OMP BoF? I put some candidates in CC, but > everyone's welcome to contribute, of course! I suggest this should > really be BoF-like: more interaction and discussion instead of too much > presentation. > > > Grüße > Thomas
Slides from GNU Tools Cauldron
Hello, all but one videos from this year Cauldron has been edited and are now linked from https://gcc.gnu.org/wiki/cauldron2017 (plugins BoF will appear till end of week). I would also like to update the page with links to slides. If someone beats me on this and adds some or all of them as attachements to the page, I would be very happy :) Honza
Re: Vectorizer size preferences (-mprefer-* on x86)
> > Hi, > > I'm in the process of changing the vectorizer to consider all > vector sizes as advertised by targetm.autovectorize_vector_sizes > and to decide which one to use based on its cost model. > > I expect that to make sense for example when choosing between > AVX128 and AVX256 since the latter may have penalties for > cross-lane operations which the former lacks. > > Now while the option -mprefer-* and the documentation (to > some extent) suggests that the user communicates a preference > the actual implementation in the x86 backend uses -mprefer-* > to disable the use of larger vector sizes. That also makes > targetm.preferred_simd_mode somewhat redundant. > > In the light of using the cost model to decide on the vector > width used how should we go forward here? > > I would suggest to more clearly document that -mprefer-* > will disable the use of larger vector sizes and that smaller > vector sizes might be used if costs tell us to do so (or loops > do not roll enough). Implementation-wise the size chosen > by targetm.preferred_simd_mode could be used as tie-breaker > in case equal costs are computed. I suppose one useful meaning of -mpreffer would be to overwrite the decision heuristics to only consider the mode specified. (it is what we do at the moment anyway, right?) So one can use it in the case your heuristics makes wrong choice. But disabling only bigger vector sizes makes sense to me as well. Honza > > In the (distant?) future we might want to support mixed-size > vectorization which makes all this even more complicated. > Still having -mprefer-* disabling vector sizes makes sense > here? > > Thanks, > Richard.
Re: bootstrap failure due to declaration mismatch in r260956
> On 05/30/2018 12:27 PM, Gerald Pfeifer wrote: > >On Wed, 30 May 2018, Martin Sebor wrote: > >>I think your r260956 is missing the following hunk: > > > >If this fixes the bootstrap for you (also ran into this myself > >just now), can you please go ahead and commit? > > > >We can always sort out things later, if there are details to be > >tweaked, but fixing a bootstrap failure with a simple one-liner > >like this, let's not get process-heavy and just do it. ;-) > > Jakub already committed the missing change in r260970 so boostrap > should be working again. I apologize for that. I left svn commit waiting for commit log entry and did not notice that :( Thanks for fixing it! Honza > > Thanks > Martin >
Re: Problems in IPA passes
Dne 2017-10-28 09:28, Jeff Law napsal: Jan, What's the purpose behind calling vrp_meet and extract_range_from_unary_expr from within the IPA passes? AFAICT that is not safe to do. Various paths through those routines will access static objects within tree-vrp.c which may not be initialized when IPA runs (vrp_equiv_obstack, vr_value). While this seems to be working today, it's a failure waiting to happen. Is there any way you can avoid using those routines? I can't believe you really need all the complexity of those routines, particularly extract_range_from_unary_expr. Plus it's just downright fugly from a modularity standpoint. We now have three value range propagation passes. The original vrp, early vrp which is fater and ipa-cp which does IPA propagation. It seems sane to me for those three to share some infrastructure but I did not look in enough of detail into these to know if what we have right now is sane thing to do :) The VRP passes were introduced by Kugan and this part was mostly reviewed by Richi, so I am adding them to CC. The refactoring happened in https://gcc.gnu.org/ml/gcc-patches/2016-07/msg00891.html Honza ? Jeff
Re: Resolving LTO symbols for static library boundary
Dne 2018-02-05 18:44, Richard Biener napsal: On February 5, 2018 12:26:58 PM GMT+01:00, Allan Sandfeld Jensenwrote: Hello GCC In trying to make it possible to use LTO for distro-builds of Qt, I have again hit the problem of static libraries. In Qt in general we for LTO rely on a library boundary, where LTO gets resolved when generating the library but no LTO-symbols are exported in the shared library. This ensure the library has a well defined binary compatible interface and gets LTO optimizations within each library. For some private libraries we use static libraries however, because we don't need binary compatibility, but though we don't need BC between Qt versions, the libraries should still be linkable with different gcc versions (and with different compilers). However when LTO is enabled the static libraries will contain definitions that depend on a single gcc version making it unsuitable for distribution. One solution is to enable fat-lto object files for static libraries but that is both a waste of space and compile time, and disables any LTO optimization within the library. Ideally I would like to have the static library do LTO optimizations internally just like a shared library, but then exported as static library. I suspect this is more of gcc task than ar/ld task, as it basically boils down to gcc doing for a static library what it does for shared library, but maybe export the result as a single combined .o file, that can then be ar'ed into a compatible static library. Is this possible? Hmm. I think you could partially link the static archive contents into a single relocatable object. Or we could add a mode where you do a 1to1 LTO link of the objects and stop at the ltrans object files. You could stuff those into an archive again. I'm not sure how far Honza got partial LTO linking to work? Parital linking of lto .o files into single non-lto .o file should work and it will get you cross-module optimization done. The problem is that without resolution info compiler needs to assume that all symbols exported by object files are possibly referneced by the later incremental link and thus the code quality will definitly not be comparable with what you get for LTO on final binary or DSO. Still should be better than non-lto build. I would be curious if it is useful for you in practice. Honza Richard. Best regards 'Allan Jensen
Re: How big (and fast) is going to be GCC 8?
Hello, I have also re-done most of my firefox testing similar to ones I published at http://hubicka.blogspot.cz/2014/04/linktime-optimization-in-gcc-2-firefox.html (thanks to Martin Liska who got LTO builds to work again) I am attaching statistics on binary sizes. Interesting is that for firefox LTO is quite good size optimization (16% on text) and similarly FDO reduces text size and combines well with LTO, which is bit different from Martin's gcc stats. I have looked into this very briefly and one isse seems to be with the way we determine hot/cold threshold. binary size textrelocations dataEH rest gcc6 -O3904486581288735813720073 13035704257839 gcc6 -O3 -flto 7581078612145211123901858422776 240002 gcc6 -O3 + FDO 670878241300829413655305 13719944259585 gcc6 -O3 -flto + FDO6020689812169803123341139083088 240050 gcc7 -O3932334401292883113780313 13578224257408 gcc7 -O3 -flto 7676427412128031124053698420448 240662 gcc7 -O3 + FDO 675006881299427913650185 13661760263400 gcc7 -O3 -flto + FDO5977699412151360123252178971344 239501 gcc8 -O2803111201293956813763033 12948752258711 gcc8 -O2 -flto 6915675212109236124758018501152 240163 gcc8 -O3899136481292446813790393 13374328256867 gcc8 -O3 -flto 7597112212138528124266498593024 239861 gcc8 -O3 + FDO 670476321299689013707017 13146232263413 gcc8 -O3 -flto + FDO5895141012146008123771618634152 241765 I also did some builds with clang. Observation is that clang's -O3 binary is smaller than ours, while our LTO/FDO builds are smaller than clang's (LTO+FDO build quite substantially). Our EH is bigger than clang's which is probably something to look into. One problem I am aware of is that our nothrow pass is not type sensitive and thus won't figure out if program throws an exception of specific type and catches it later. clang6 -O3 847548481303201813597433 10791528371429 clang6 -O3 -flto9075702412273574122585216841424 350585 clang6 -O3 -flto=thin 9294057612376724124792337974856 353171 clang6 -O3 + FDO817768801313642813574489 11501344385123 clang6 -O3 -flto=thin+FDO 88374432 12405075124342979574416 356508 clang6 -O3 -flto + FDO 9063716812288433122442659023304 349078 I also did some benchmarking and found at least an issue with -flto -O3 hitting --param inline-unit-growth bit too early so we do not get much benefits (while clang does but it also does not reduce binary size). -O3 -flto + FDO or -O2 -flto seems to work well. Will summarize the results later. Firefox developer Tom Ritter has tested LTO with FDO and without here (it is rather nice interface - I like that one can click to the graph and see the results in context of other tests done recently). This is done with gcc6. Tracking bug: https://bugzilla.mozilla.org/show_bug.cgi?format=default=521435 non-FDO build: https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-central=try=12ce14a5bcac9975b41a1f901bfc3a8dcb2d791b=1=1=172800 FDO build: https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-central=try=7e5bd52e36fcc1703ced01fe87e831a716677295=1=1=172800 Honza
Re: GSOC 2018 - Textual LTO dump tool project
> On Tue, Mar 6, 2018 at 4:02 PM, Jan Hubicka <hubi...@ucw.cz> wrote: > >> On Tue, Mar 6, 2018 at 2:30 PM, Hrishikesh Kulkarni > >> <hrishikeshpa...@gmail.com> wrote: > >> > Hi, > >> > > >> > Thank you Richard and Honza for the suggestions. If I understand > >> > correctly, > >> > the issue is that LTO file format keeps changing per compiler versions, > >> > so > >> > we need a more “stable” representation and the first step for that would > >> > be > >> > to “stabilize” representations for lto-cgraph and symbol table ? > >> > >> Yes. Note the issue is that the current format is a 1:1 representation of > >> the internal representation -- which means it is the internal > >> representation > >> that changes frequently across releases. I'm not sure how Honza wants > >> to deal with those changes in the context of a "stable" IL format. Given > >> we haven't been able to provide a stable API to plugins I think it's much > >> harder to provide a stable streaming format for all the IL details > >> > >> > Could you > >> > please elaborate on what initial steps need to be taken in this regard, > >> > and > >> > if it’s feasible within GSoC timeframe ? > >> > >> I don't think it is feasible in the GSoC timeframe (nor do I think it's > >> feasible > >> at all ...) > > > > I skipped this, with GSoC timeframe I fully agree. With feasibility at all > > not so > > much - LLVM documents its bitcode to reasonable extend > > https://llvm.org/docs/BitCodeFormat.html > > > > Reason why i mentioned it is that I would like to use this as an excuse to > > get > > things incrementally cleaned up and it would be nice to keep it in mind > > while > > working on this. > > Ok. It's probably close enough to what I recommended doing with respect > to make the LTO bytecode "self-descriptive" -- thus start with making the > structure documented and parseable without assigning semantics to > every bit ;) I think that can be achieved top-down in a very incremental > way if you get the bottom implemented first (the data-streamer part). OK :) I did not mean to document every bit either, at least not for the fancy parts. It would be nice to have clenned up i.e. the section headers/footers so they do not depend on endianity and slowly cleanup similar nonsences at higher levels. So it may make sense to progress from both directions lower hanging fruits first. Honza
Re: GSOC 2018 - Textual LTO dump tool project
> On Tue, Mar 6, 2018 at 2:30 PM, Hrishikesh Kulkarni > <hrishikeshpa...@gmail.com> wrote: > > Hi, > > > > Thank you Richard and Honza for the suggestions. If I understand correctly, > > the issue is that LTO file format keeps changing per compiler versions, so > > we need a more “stable” representation and the first step for that would be > > to “stabilize” representations for lto-cgraph and symbol table ? > > Yes. Note the issue is that the current format is a 1:1 representation of > the internal representation -- which means it is the internal representation > that changes frequently across releases. I'm not sure how Honza wants > to deal with those changes in the context of a "stable" IL format. Given > we haven't been able to provide a stable API to plugins I think it's much > harder to provide a stable streaming format for all the IL details Well, because I think it would be good for us to more formalize our IL - document it properly, remove stuff which is not necessary and get an API+file representation. Those things are connected to each other and will need work. If you look how much things chage, it is not very frequent we would change what is in our CFG (I changed profile this release), how gimple tuples are represented and what gimple instructions we have. I think those parts are resonably well defined. Even if I changed profile this release it is relatively localized type of change. I am still more commonly changing symbol table as it needs to adapt for all LTO details, but I hope to be basically done. What is more on the go are trees that we will hopefully deal with by defining gimple types now with early debug done. What we can do realistically now is to first aim to stream those of better defined parts in externally parseable sections which do have documentation. So far only externally parseable section is the plugin symbol table, but we should be able to do so with reasonable effort for symbol tables, CFGs and gimple instruction streams. In parallel we can incrementally deal with trees mostly hopefully by getting rid of them (moving symbol names/etc to symbol table so it can live w/o declarations, having gimple types etc.) > > > Could you > > please elaborate on what initial steps need to be taken in this regard, and > > if it’s feasible within GSoC timeframe ? > > I don't think it is feasible in the GSoC timeframe (nor do I think it's > feasible > at all ...) > > > Thanks! > > > > > > I am trying to break down the project into milestones for the proposal. So > > far, I have identified the following objectives: > > > > 1] Creating a separate driver, that can read LTO object files. Following > > Richard’s estimate, I’d leave around first half of the period for this task. > > > > Would that be OK ? > > Yes. Yes, it looks good to me too. > > > Coming to 2nd half: > > > > 2] Dumping pass summaries. > > > > 3] Stabilizing lto-cgraph and symbol table. > > So I'd instead do > > 3] Enhance the user-interface of the driver > > like providing a way to list all function bodies, a way to dump > the IL of a single function body, a way to create a dot graph file > for the cgraph in the file, etc. > > Basically while there's a lot of dumping infrastructure in GCC > it may not always fit the needs of a LTO IL dumping tool 1:1 > and may need refactoring enhancement. I would agree here - dumping pass summaries would be nice but we already have that more or less. All IPA passes dump their summary into beggining of their dump file and I find that relatively sufficient to deal with mostly because summaries are quite simple. It is much harder to deal with the global sream of trees and function bodies themselves. Honza > > Richard. > > > > > Thanks, > > > > Hrishikesh > > > > > > > > On Fri, Mar 2, 2018 at 6:31 PM, Jan Hubicka <hubi...@ucw.cz> wrote: > >> > >> Hello, > >> > On Fri, Mar 2, 2018 at 10:24 AM, Hrishikesh Kulkarni > >> > <hrishikeshpa...@gmail.com> wrote: > >> > > Hello everyone, > >> > > > >> > > > >> > > Thanks for your suggestions and engaging response. > >> > > > >> > > Based on the feedback I think that the scope of this project comprises > >> > > of > >> > > following three indicative actions: > >> > > > >> > > > >> > > 1. Creating separate driver i.e. separate dump tool that uses lto > >> > > object API > >> > > for reading the lto file. > >> > > >> > Yes. I expect this will take th
Re: How big (and fast) is going to be GCC 8?
> On Tue, Mar 6, 2018 at 11:12 AM, Martin Liškawrote: > > Hello. > > > > Many significant changes has landed in mainline and will be released as GCC > > 8.1. > > I decided to use various GCC configs we have and test how there > > configuration differ > > in size and also binary size. > > > > This is first part where I measured binary size, speed comparison will > > follow. > > Configuration names should be self-explaining, the 'system-*' is built done > > without bootstrap with my system compiler (GCC 7.3.0). All builds are done > > on my Intel Haswell machine. > > So from the numbers I see that bootstrap causes a 8% bigger binary compared > to non-bootstrap using GCC 7.3 at -O2 when including debug info and 1.2% > larger stripped. That means trunk generates larger code. It is bit odd indeed because size stats from specs seems to imply otherwise. It would be nice to work that out. Also I am surprised that LTO increases text size even for non-plugin build. I should not happen. These issues are generally hard to debug though. I will try to take a look. I will send similar stats for my firefox experiments. If you have scripts to collect them, they would be welcome. Thanks for looking into this! Honza > > What is missing is a speed comparison of the various binaries -- you could > try measuring this by doing a make all-gcc for a non-bootstrap config > (so it uses -O2 -g and doesn't build target libs with the built compiler). > > Richard. > > > Feel free to reply if you need any explanation. > > Martin
Re: GSOC 2018 - Textual LTO dump tool project
> On Tue, Mar 6, 2018 at 2:30 PM, Hrishikesh Kulkarni > <hrishikeshpa...@gmail.com> wrote: > > Hi, > > > > Thank you Richard and Honza for the suggestions. If I understand correctly, > > the issue is that LTO file format keeps changing per compiler versions, so > > we need a more “stable” representation and the first step for that would be > > to “stabilize” representations for lto-cgraph and symbol table ? > > Yes. Note the issue is that the current format is a 1:1 representation of > the internal representation -- which means it is the internal representation > that changes frequently across releases. I'm not sure how Honza wants > to deal with those changes in the context of a "stable" IL format. Given > we haven't been able to provide a stable API to plugins I think it's much > harder to provide a stable streaming format for all the IL details > > > Could you > > please elaborate on what initial steps need to be taken in this regard, and > > if it’s feasible within GSoC timeframe ? > > I don't think it is feasible in the GSoC timeframe (nor do I think it's > feasible > at all ...) I skipped this, with GSoC timeframe I fully agree. With feasibility at all not so much - LLVM documents its bitcode to reasonable extend https://llvm.org/docs/BitCodeFormat.html Reason why i mentioned it is that I would like to use this as an excuse to get things incrementally cleaned up and it would be nice to keep it in mind while working on this. Honza > > > Thanks! > > > > > > I am trying to break down the project into milestones for the proposal. So > > far, I have identified the following objectives: > > > > 1] Creating a separate driver, that can read LTO object files. Following > > Richard’s estimate, I’d leave around first half of the period for this task. > > > > Would that be OK ? > > Yes. > > > Coming to 2nd half: > > > > 2] Dumping pass summaries. > > > > 3] Stabilizing lto-cgraph and symbol table. > > So I'd instead do > > 3] Enhance the user-interface of the driver > > like providing a way to list all function bodies, a way to dump > the IL of a single function body, a way to create a dot graph file > for the cgraph in the file, etc. > > Basically while there's a lot of dumping infrastructure in GCC > it may not always fit the needs of a LTO IL dumping tool 1:1 > and may need refactoring enhancement. > > Richard. > > > > > Thanks, > > > > Hrishikesh > > > > > > > > On Fri, Mar 2, 2018 at 6:31 PM, Jan Hubicka <hubi...@ucw.cz> wrote: > >> > >> Hello, > >> > On Fri, Mar 2, 2018 at 10:24 AM, Hrishikesh Kulkarni > >> > <hrishikeshpa...@gmail.com> wrote: > >> > > Hello everyone, > >> > > > >> > > > >> > > Thanks for your suggestions and engaging response. > >> > > > >> > > Based on the feedback I think that the scope of this project comprises > >> > > of > >> > > following three indicative actions: > >> > > > >> > > > >> > > 1. Creating separate driver i.e. separate dump tool that uses lto > >> > > object API > >> > > for reading the lto file. > >> > > >> > Yes. I expect this will take the whole first half of the project, > >> > after this you > >> > should be somewhat familiar with the infrastructure as well. With the > >> > existing dumping infrastructure it should be possible to dump the > >> > callgraph and individual function bodies. > >> > > >> > > > >> > > 2. Extending LTO dump infrastructure: > >> > > > >> > > GCC already seems to have dump infrastructure for pretty-printing tree > >> > > nodes, gimple statements etc. However I suppose we’d need to extend > >> > > that for > >> > > dumping pass summaries ? For instance, should we add a new hook say > >> > > “dump” > >> > > to ipa_opt_pass_d that’d dump the pass > >> > > summary ? > >> > > >> > That sounds like a good idea indeed. I'm not sure if this is the most > >> > interesting > >> > missing part - I guess we'll find out once a dump tool is available. > >> > >> Concering the LTO file format my longer term aim is to make the symbol > >> table sections (symtab used by lto-plugin as well as the callgraph > >> section) > >> and hopefully also the Gimple streams) documented and well behaving > >> without changing the format in every revision. > >> > >> On the other hand the summaries used by individual passes are intended to > >> be > >> pass specific and envolving as individula passes become stronger/new > >> passes > >> are added. > >> > >> It is quite a lot of work to stabilize gimple representation to this > >> extend, > >> For callgraph table this is however more realistic. That would mean > >> to > >> move some of existing random stuff streamed there into summaries and > >> additionaly > >> cleaning up/rewriting lto-cgraph so the on disk format actually makes > >> sense. > >> > >> I will be happy to help with any steps in this direction as well. > >> > >> Honza > > > >
Re: How can compiler speed-up postgresql database?
> On 03/21/2018 10:26 AM, Richard Biener wrote: > >On Tue, Mar 20, 2018 at 8:57 PM, Martin Liškawrote: > >>Hi. > >> > >>I did similar stats for postgresql server, more precisely for pgbench: > >>pgbench -s100 & 10 runs of pgbench -t1 -v > > > >Without looking at the benchmark probably only because it is flawed > >(aka not I/O or memory bandwidth limited). It might have some > >actual operations on data (regex code?) that we can speed up though. > > Well, it's not ideal as it tests quite simple DB with just couple of tables: > ``` > By default, pgbench tests a scenario that is loosely based on TPC-B, > involving five SELECT, UPDATE, and INSERT commands per transaction. > ``` > > Note that I had pg_data in /dev/shm and I verified that CPU utilization was > 100% on a single core. > That said, it should not be so misleading ;) Well, it is usually easy to do perf and look how hot spots looks like. I see similar speedups for common page lading at firefox and similar benchmarks that are quite good. Not everything needs to be designed to be memory bound like spec. Honza > > Martin > > > > >Richard. > > > >>Martin >
Re: GSOC 2018 - Textual LTO dump tool project
Hello, > On Fri, Mar 2, 2018 at 10:24 AM, Hrishikesh Kulkarni >wrote: > > Hello everyone, > > > > > > Thanks for your suggestions and engaging response. > > > > Based on the feedback I think that the scope of this project comprises of > > following three indicative actions: > > > > > > 1. Creating separate driver i.e. separate dump tool that uses lto object API > > for reading the lto file. > > Yes. I expect this will take the whole first half of the project, > after this you > should be somewhat familiar with the infrastructure as well. With the > existing dumping infrastructure it should be possible to dump the > callgraph and individual function bodies. > > > > > 2. Extending LTO dump infrastructure: > > > > GCC already seems to have dump infrastructure for pretty-printing tree > > nodes, gimple statements etc. However I suppose we’d need to extend that for > > dumping pass summaries ? For instance, should we add a new hook say “dump” > > to ipa_opt_pass_d that’d dump the pass > > summary ? > > That sounds like a good idea indeed. I'm not sure if this is the most > interesting > missing part - I guess we'll find out once a dump tool is available. Concering the LTO file format my longer term aim is to make the symbol table sections (symtab used by lto-plugin as well as the callgraph section) and hopefully also the Gimple streams) documented and well behaving without changing the format in every revision. On the other hand the summaries used by individual passes are intended to be pass specific and envolving as individula passes become stronger/new passes are added. It is quite a lot of work to stabilize gimple representation to this extend, For callgraph table this is however more realistic. That would mean to move some of existing random stuff streamed there into summaries and additionaly cleaning up/rewriting lto-cgraph so the on disk format actually makes sense. I will be happy to help with any steps in this direction as well. Honza
Enabling vectorization at -O2 for x86 generic, core and zen tuning
Hello, while running benchmarks for inliner tuning I also run benchmarks comparing -O2 and -O2 -ftree-vectorize -ftree-slp-vectorize using Martin Liska's LNT setup (https://lnt.opensuse.org/). The results are summarized below but you can also see also colorful table produced by Martin's LNT magic https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report?num_runs=3_percentage_change=0.02=746f%2C55f=IwAR1EhvEnavV5Fg5g404cTrguOXG2cW7b3mRZZvtYn1qy93zihyAanZ7AiWQ https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report?num_runs=10_percentage_change=0.02=746f%2C55f Overall we got following SPECrate improvements: SPECfp2k6 kabylake generic +7.15% SPECfp2k6 kabylake native +9.36% SPECfp2k17 kabylake generic +5.36% SPECfp2k17 kabylake native +6.03% SPECint2k17 kabylake generic +4.13% SPECfp2k6 zen generic +9.98% SPECfp2k6 zen native +7.04% SPECfp2k17 zen generic +6.11% SPECfp2k17 zen native +5.46% SPECint2k17 zen generic +3.61% SPECint2k17 zen native +5.18% The performance results seems surprisingly a lot in favor of vectorization. Martin's setup is also checking code size which goes up by as much 26% on leslie 3d, but since many of benchmarks are small, this is not very representative for overall code size/compile time costs of vectorization. I measured compile time/size on larger programs I have available with notable changes on DealII, but otherwise sub 1% increases. I also benchmarked Firefox but there are no significant differences because build system already uses -O3 for places where it matters (graphics library etc.) Compile timecode segment size Firefox mainlin in noise 0.8% gcc from spec2k60.5% 0.6% gdb 0.8% 0.3% crafty 0% 0% DealII 3.2% 4% Note that I benchmarked -ftree-slp-vectorize separately before and results was hit/miss, so perhaps enabling only -ftree-vectorize would give better compile time tradeoffs. I was worried of partial memory stalls, but I will benchmark it and also benchmark difference between cost models. There are some performance regressions, most notably in SPEC - exchange (all settings), - gamess (all settings), - calculix (Zen native only), - bwaves (zen native) and induct2 on all settings and ffft2 zen only from Polyhedron. Botan seems very noisy, but it is rather special code. Exchange can be fixed by adding heuristics that it is bad idea to vectorize withing loop nest of 10 containing recursive call. I believe gamess and calculix are understood and i can look into the remaining cases. Overall I am surprised how many improvements vectorization at -O2 can do - clearly more parallel CPUs depends it depends on it. In my experience from analyzing regressions of gcc -O2 compared to clang -O2 buids, vectorization is one of most common reasons. Having gcc -O2 producing lower SPEC scores and comparably large binaries to clang -O2 does not feel OK and I think the problem is not limited just to artificial benchmarks. Even though it is late in release cycle I wonder if we can do that for GCC 9? Performance of vectorization is very architecture specific, I would propose enabling vectorization for Zen, core based chips and generic in x86-64. I can also run benchmarks on buldozer. I can then tune down the cheap model to avoid some of more expensive transformations. Honza Kabylake Spec2k6, generic tuning improvements: SPEC2006/FP/481.wrf -31.33% SPEC2006/FP/436.cactusADM -28.17% SPEC2006/FP/437.leslie3d-17.21% SPEC2006/FP/434.zeusmp -12.90% SPEC2006/FP/454.calculix-6.44% SPEC2006/FP/433.milc-6.03% SPEC2006/FP/459.GemsFDTD-4.65% SPEC2006/FP/450.soplex -2.11% SPEC2006/INT/403.gcc-6.54% SPEC2006/INT/456.hmmer -5.45% SPEC2006/INT/464.h264ref-2.23% regresions: SPEC2006/FP/416.gamess 8.51% SPEC2006/FP/447.dealII 2.73% Kabylake spec2k6 -march=native improvements: SPEC2006/FP/436.cactusADM -45.52% SPEC2006/FP/481.wrf -34.13% SPEC2006/FP/434.zeusmp -20.25% SPEC2006/FP/437.leslie3d-19.44% SPEC2006/FP/459.GemsFDTD-6.85% SPEC2006/FP/433.milc-2.15% SPEC2006/INT/456.hmmer -8.97% SPEC2006/INT/403.gcc-7.07% SPEC2006/INT/464.h264ref-3.00% regressions: SPEC2006/FP/416.gamess 7.97% SPEC2006/INT/483.xalancbmk 3.55% SPEC2006/INT/400.perlbench 2.61% Kabylake spec2k17 generic tuning improvements: SPEC2017/INT/525.x264_r -33.24%
Re: Enabling vectorization at -O2 for x86 generic, core and zen tuning
> > Note that I benchmarked -ftree-slp-vectorize separately before and > > results was hit/miss, so perhaps enabling only -ftree-vectorize would > > give better compile time tradeoffs. I was worried of partial memory > > stalls, but I will benchmark it and also benchmark difference between > > cost models. > > ; Alias to enable both -ftree-loop-vectorize and -ftree-slp-vectorize. > ftree-vectorize > Common Report Optimization > Enable vectorization on trees. Thanks! I would probably fall into that trap and run same set of benchmarks again. Honza > > -- > Eric Botcazou
Re: Enabling vectorization at -O2 for x86 generic, core and zen tuning
> On Mon, Jan 07, 2019 at 09:29:09AM +0100, Richard Biener wrote: > > On Sun, 6 Jan 2019, Jan Hubicka wrote: > > > Even though it is late in release cycle I wonder if we can do that for > > > GCC 9? Performance of vectorization is very architecture specific, I > > > would propose enabling vectorization for Zen, core based chips and > > > generic in x86-64. I can also run benchmarks on buldozer. I can then > > > tune down the cheap model to avoid some of more expensive > > > transformations. > > > > I'd rather not do this now, it's _way_ too late (also considering > > you are again doing inliner tuning so late). > > This probably should be more generic than just x86 really, we have similar > problems on Power (-O3 is almost always faster than -O2, which is bad). > Likely other archs have the same problems. > > But yes, too late for GCC 9. Yep, I guessed so, still wanted to ask :) I think this is similar to schedule-insns(2) which is subtarget specific whether it is a win or not. So I think it is good to leave up to target to enable the pass - we probably have fewer targets that do want vectorizing than those we don't. I would suggest enabling it on x86 early next stage1 and try to do similar benchmarks on ppc and arm. We can then try to tune the code size/speed tradeoffs. Honza > > > Segher
Re: Garbage collection bugs
> On Wed, Jan 9, 2019 at 12:48 PM Richard Biener > wrote: > > > > On Wed, Jan 9, 2019 at 10:46 AM Joern Wolfgang Rennecke > > wrote: > > > > > > We've been running builds/regression tests for GCC 8.2 configured with > > > --enable-checking=all, and have observed some failures related to > > > garbage collection. > > > > > > First problem: > > > > > > The g++.dg/pr85039-2.C tests (I've looked in detail at -std=c++98, but > > > -std=c++11 and -std=c++14 appear to follow the same pattern) see gcc > > > garbage-collecting a live vector. A subsequent access to the vector > > > with vec_quick_push causes a segmentation fault, as m_vecpfx.m_num is > > > 0xa5a5a5a5 . The vec data is also being freed / poisoned. The vector in > > > question is an auto-variable of cp_parser_parenthesized_expression_list, > > > which is declared as: vec *expression_list; > > > > > > According to doc/gty/texi: "you should reference all your data from > > > static or external @code{GTY}-ed variables, and it is advised to call > > > @code{ggc_collect} with a shallow call stack." > > > > > > In this case, cgraph_node::finalize_function calls the garage collector, > > > as we are finishing a member function of a struct. gdb shows a backtrace > > > of 34 frames, which is not really much as far as C++ parsing goes. The > > > caller of finalize_function is expand_or_defer_fn, which uses the > > > expression "function_depth > 1" to compute the no_collect paramter to > > > finalize_function. > > > cp_parser_parenthesized_expression_list is in frame 21 of the backtrace > > > at this point. > > > So, if we consider this shallow, cp_parser_parenthesized_expression_list > > > either has to refrain from using a vector with garbage-collected > > > allocation, or it has to make the pointer reachable from a GC root - at > > > least if function_depth <= 1. > > > Is the attached patch the right approach? > > > > > > When looking at regression test results for gcc version 9.0.0 20181028 > > > (experimental), the excess errors test for g++.dg/pr85039-2.C seems to > > > pass, yet I can see no definite reason in the source code why that is > > > so. I tried running the test by hand in order to check if maybe the > > > patch for PR c++/84455 plays a role, > > > but running the test by hand, it crashes again, and gdb shows the > > > telltale a5 pattern in a pointer register. > > > #0 vec::quick_push (obj=, > > > this=0x705ece60) > > > at > > > /data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/vec.h:974 > > > #1 vec_safe_push (obj=, > > > v=@0x7fffd038: 0x705ece60) > > > at > > > /data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/vec.h:766 > > > #2 cp_parser_parenthesized_expression_list ( > > > parser=parser@entry=0x77ff83f0, > > > is_attribute_list=is_attribute_list@entry=0, > > > cast_p=cast_p@entry=false, > > > allow_expansion_p=allow_expansion_p@entry=true, > > > non_constant_p=non_constant_p@entry=0x7fffd103, > > > close_paren_loc=close_paren_loc@entry=0x0, wrap_locations_p=false) > > > at > > > /data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/cp/parser.c:7803 > > > #3 0x006e910d in cp_parser_initializer ( > > > parser=parser@entry=0x77ff83f0, > > > is_direct_init=is_direct_init@entry=0x7fffd102, > > > non_constant_p=non_constant_p@entry=0x7fffd103, > > > subexpression_p=subexpression_p@entry=false) > > > at > > > /data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/cp/parser.c:22009 > > > #4 0x0070954e in cp_parser_init_declarator ( > > > parser=parser@entry=0x77ff83f0, > > > decl_specifiers=decl_specifiers@entry=0x7fffd1c0, > > > checks=checks@entry=0x0, > > > function_definition_allowed_p=function_definition_allowed_p@entry=true, > > > member_p=member_p@entry=false, declares_class_or_enum= > > out>, > > > function_definition_p=0x7fffd250, maybe_range_for_decl=0x0, > > > init_loc=0x7fffd1ac, auto_result=0x7fffd2e0) > > > at > > > /data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/cp/parser.c:19827 > > > #5 0x00711c5d in cp_parser_simple_declaration > > > (parser=0x77ff83f0, > > > function_definition_allowed_p=, > > > maybe_range_for_decl=0x0) > > > at > > > /data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/cp/parser.c:13179 > > > #6 0x00717bb5 in cp_parser_declaration (parser=0x77ff83f0) > > > at > > > /data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/cp/parser.c:12876 > > > #7 0x0071837d in cp_parser_translation_unit > > > (parser=0x77ff83f0) > > > at > > > /data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/cp/parser.c:4631 > > > #8 c_parse_file () > > > at > > > /data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/cp/parser.c:39108 > > > #9 0x00868db1 in
Re: Large Fortran program succesfully LTO'd.
> Honza, > > At the Cauldron meeting last week I mentioned that I wasn't able to compile > our "small" weather forecasting program with LTO. > > In the mean time I have read some bug reports with "multiple prevailing ..." > errors, which made me try linking with the 'gold' linker - that worked. > > So the only things building the software that I changed were adding -flto > and -fuse-ld=gold. > > Some statistics of the source code: > > 3902 files totaling 1081944 lines. > > The result works flawlessly. > > Over the weekend I will study the LTO diagnostics to see what decisions were > made with respect to inlining and other optimizations. Great to hear, thanks! I wonder if it would be possible to have -flto-report -fmem-report -Q output of the link. I am just trying to get more agressive type merging to work but it is C++ only (because i can make use of the ODR rule here) and I wonder how many types you get in other languages (hope is that C++ is the type heaviest of them all) I would be interested if performance looks good, too! Honza > > Thanks ! > > -- > Toon Moene - e-mail: t...@moene.org - phone: +31 346 214290 > Saturnushof 14, 3738 XG Maartensdijk, The Netherlands > At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/ > Progress of GNU Fortran: http://gcc.gnu.org/wiki/GFortran#news
Re: GCC 8 vs. GCC 9 speed and size comparison
> Hi, > > On Mon, 15 Apr 2019, Martin Liška wrote: > > > There's a similar comparison that I did for the official openSUSE gcc > > packages. gcc8 is built with PGO, while the gcc9 package is built in 2 > > different configurations: PGO, LTO, PGO+LTO (LTO used for FE in stage4, > > for generators in stage3 as well). > > > > Please take a look at attached statistics. > > It seems the C++ parser got quite a bit slower with gcc 9 :-( Most visible > in the compile time for tramp-3d (24%) and kdecore.cc (18% slower with > just PGO); it seems that the other .ii files are C-like enough to not > trigger this behaviour, so it's probably something to do with templates > and/or classes. Would be possible to have -ftime-report for your tramp3d build with the four compilers? It may be a consequence of training changes and i.e. template instantiation code which is excercised by tramp3d a lot more than by GCC bootstrap. But it still seem bit too much for simple PGO divergence. It also may be caused by changes of inliner decisions or some other pass being overly active. Honza > > > Ciao, > Michael.
Re: Expanding roundeven (Was: Re: About GSOC.)
> Hello. > I have already sent a patch for expanding roundeven for i386 with > relevant doubts. I also was regression testing with > make -k check > after successful bootstrap build with reverting my patches. Turns out > do-check fails without any patches applied, Is it ok to do anyways for > applied patch? Also, /contrib/compare_tests returns with "no common > files to compare" but I see all as the same files (actually, I am > moving *.sum files in manual directory and comparing with build of > applied patch.)? Is it due to failed make check? The usual procedure is to do make check on clean tree and use contrib/test_summary to produce summary and then do the same with patched tree and compare the outputs (by diff) Honza > > Thanks, > -Tejas > > > On Wed, 19 Jun 2019 at 19:07, Tejas Joshi wrote: > > > > Hello. > > I have made following changes to inspect inlining of roundeven with > > the following test code: > > double > > plusone (double d) > > { > > return __builtin_roundeven (d) + 1; > > } > > > > Running the program using -O2 foo.c gave internal compiler error which > > I believe is because gcc_unreachable() at: > > > > if (TARGET_SSE4_1) > > emit_insn (gen_sse4_1_round2 > >(operands[0], operands[1], GEN_INT (ROUND_ > >| ROUND_NO_EXC))); > > I think the following condition matches the criterion? : > > > I think the code will be much clearer if it explicitly says > > > ROUND_ROUNDEVEN | ROUND_NO_EXC, > > > > else if (TARGET_64BIT || (mode != DFmode)) > > { > > if (ROUND_ == ROUND_FLOOR) > > ix86_expand_floorceil (operands[0], operands[1], true); > > else if (ROUND_ == ROUND_CEIL) > > ix86_expand_floorceil (operands[0], operands[1], false); > > else if (ROUND_ == ROUND_TRUNC) > > ix86_expand_trunc (operands[0], operands[1]); > > else > > gcc_unreachable (); > > } > > in: > > (define_expand "2" > > But with -mavx, generated the vroundsd insn. Does it mean ROUNDEVEN > > should have a condition in the else if, but comments above > > ix86_expand* functions in i386-expand.c says that those are for SSE2 > > sequences? > > > > Thanks, > > --Tejas > > > > > > On Mon, 17 Jun 2019 at 22:45, Joseph Myers wrote: > > > > > > On Mon, 17 Jun 2019, Tejas Joshi wrote: > > > > > > > > existing ROUND_NO_EXC definition in GCC. A new definition will need > > > > > adding alongside ROUND_FLOOR, ROUND_CEIL and ROUND_TRUNC to > > > > > correspond to > > > > > rounding to nearest with ties to even, evaluating to 0.) > > > > > > > > So (ROUND_ROUNDEVEN 0x0) be declared for rounding towards nearest > > > > even for rounding mode argument? But reference says that RC field > > > > should end up as 00B for rounding ties to even? I am also much > > > > > > I think the code will be much clearer if it explicitly says > > > ROUND_ROUNDEVEN | ROUND_NO_EXC, than if it hardcodes implicit knowledge > > > that 0 is the value used for rounding to nearest with ties to even. > > > > > > -- > > > Joseph S. Myers > > > jos...@codesourcery.com
Re: [RFC] zstd as a compression algorithm for LTO
> > At least allow it to be built as part of the normal build like GMP, > etc. are done. > And include it in downloading using contrib/download_prerequisites > like the libraries are done. Anoying detail is that zstd builds with cmake, not autotools Honza > > Thanks, > Andrew Pinski > > > > > Richard. > > > > >jeff > >
Re: [RFC] zstd as a compression algorithm for LTO
> On 6/20/19 12:58 PM, Thomas Koenig wrote: > > Am 20.06.19 um 11:07 schrieb Martin Liška: > >> On the contrary, decompression > >> of zstd with zlib will end with: > >> lto1: internal compiler error: compressed stream: data error > > > > Sogenerating object files on one system and trying to read them > > on another system which does not happen to have a particular > > library installed would lead to failure? If that's the case, > > I am not sure that this is a good way of handling things. > > Yes, but LTO bytecode is not supposed to be a distributable format. In longer term it should be. We ought to make it host independent and stable at least within major releases. I guess it is still OK to make zstd enabled host build require zstd enabled gcc elsewhere. Just the error message should be more informative which I think is not hard to do - both zstd and zlip should have recognizable header. Other option is to put this into some common place per file. Honza > > Martin
Re: [PATCH] Add .gnu.lto_.meta section.
> On 6/21/19 2:34 PM, Richard Biener wrote: > > On Fri, Jun 21, 2019 at 12:20 PM Martin Liška wrote: > >> > >> Hi. > >> > >> The patch is about a new ELF section that will contain information > >> about LTO version. And for the future, used compression will be stored > >> here. The patch removes streaming of the version into each section. > > > > I'd like each section to have a header containing a compression method > > (compressed or not, and now zlib vs. zstd). We currently have a mix > > and knowledge is hidden. > > That would be possible, good idea. > > > > > I also remember I had old patches to make the data streamer compress > > the stream on-the-fly, not requiring the full uncompressed stream in > > memory and then re-access it for compression (needing more temporary > > memory). But IIRC the speedup was marginal. It may be more interesting as memory optimization now we do parallel WPA streaming BTW. > > > > My thought is also that we should have exactly _one_ ELF section > > for the LTO bytecode and our own container inside (much like > > simple-object does for non-ELF). So having another one is, well, ugly ;) > > Having N sections for all symbols (functions and variables) is handy because > we read some of the during WPA (by IPA ICF). So having all in a single section > would make decompression more complicated. There is nothing preventing us from implementing our own subsections in that section where each is compressed independently and can be copied to output file and still save size of headers needed for that. For example all sections are keyed by the lto section enum + possibly a decl that can be both streamed as integers rather than quite lenghtly strings. (the decl to name is streamed into the global decl stream and available at all time). I guess one can go for one global LTO major/minor in the section header and then section type/is compressed/size + optional decl ID for each subsection reducing headers to about 1+8+optional 8 bytes per header. This can be all placed at the begining of the file followed by RAW data of individual sections. I bet this is a lot smaller than ELF overhead :) > > I'm going to prepare a patch that will pull out a LTO section header > from compressed stream. This looks like good step (and please stream it in host independent way). I suppose all these issues can be done one-by-one. Honza > > Martin > > > > >> Disadvantage is a format change that will lead to following errors when > >> LTO bytecode is used from a different version: > >> > >> $ gcc x.o > >> lto1: fatal error: bytecode stream in file ‘x.o’ generated with GCC > >> compiler older than 10.0 > >> > >> $ gcc-9 main.o > >> lto1: fatal error: bytecode stream in file ‘main.o’ generated with LTO > >> version 850.0 instead of the expected 8.0 > >> > >> I've been testing the patch. > >> Thoughts? > >> Martin >
Re: Dropping support of repo files (tlink)
> > >> I should have been clearer about Darwin: > > >> > > >> collect2 is required because it wraps the calling of lto-wrapper and ld. > > >> > > >> FWIW Darwin also passes all the “-frepo” testcases, however, I’m not > > >> aware of anyone actually > > >> using case #2 from Jonathan’s post. > > >> > > >> So, AFAIK the tlink capability isn’t required for modern C++ on Darwin; > > >> but, maybe deprecation is a > > >> safer step. > > > > > > Thank you for the information. > > > > > > Yes, I would be fine to deprecate that for GCC 10.1 > > > > “thinking aloud” - would it work to deprecate (or even disallow immediately > > if no-one is using it) LTO + frepo? > > (so that non-LTO frepo continues as long as anyone needs it) > > Does that even work? ;) It would need an intermediate compile-step to > instantiate templates to > another LTO object. It is iterating the linker executions so it definitly makes no sense. One problem is that in the collect2 at the time of deciding whether to open temporary file for the linker output (which is the problem Martin tries to solve) we do not know yet whether we will do -flto (because that will be decided by linker plugin later) or repo (because it works by looking for .repo files only after the link-step failed). I suppose we could block -frepo when -flto is on the command line that is most common case for LTO builds arriving to bit odd situation that at link-time -flto will effect nothing but colors of diagnostics (I would really like to have this solved for gcc 10 in some way) Honza > > Richard. > > > > > Iain > > > > > > > >
Re: [PATCH] Deprecate -frepo option.
> > > On 27 Jun 2019, at 19:21, Jan Hubicka wrote: > > > >> > >> It's useful on targets without COMDAT support. Are there any such > >> that we care about at this point? > >> > >> If the problem is the combination with LTO, why not just prohibit that? > > > > The problem is that at the collect2 time we want to decide whether to > > hold stderr/stdout of the linker. The issue is that we do not know yet > > if we are going to LTO (because that is decided by linker plugin) or > > whether we do repo files (because we look for those only after linker > > failure). > > you could pre-scan for LTO, as is done for the collect2+non-plugin linker. > (similar to the comment below, that would take some time, but probably > small c.f the actual link). the collect2+non-plugin does not handle static archives so it would need more work while I think we should kill that code (and make darwin to use lld) Honza > Iain > > > We can look for repo files first but that could take some time > > especially for large programs (probably not too bad compared to actual > > linking later) or we may require -frepo to be passed to collect2. > > > > Honza > >> > >> Jason >
Re: [PATCH] Deprecate -frepo option.
> > It's useful on targets without COMDAT support. Are there any such > that we care about at this point? > > If the problem is the combination with LTO, why not just prohibit that? The problem is that at the collect2 time we want to decide whether to hold stderr/stdout of the linker. The issue is that we do not know yet if we are going to LTO (because that is decided by linker plugin) or whether we do repo files (because we look for those only after linker failure). We can look for repo files first but that could take some time especially for large programs (probably not too bad compared to actual linking later) or we may require -frepo to be passed to collect2. Honza > > Jason
Re: Questions about IPA/clones and new LTO pass
> On Mon, 2019-12-09 at 17:59 -0500, Erick Ochoa wrote: > > Hello, > > > > this is an update on the LTO pass we've been working on. The > > optimization is called ipa-initcall-cp because it propagates constant > > values written to variables with static lifetimes (such as ones > > initialized in initialization functions). > [ ... ] > Just a note, I suspect most of the development community is currently > focused on stage3 bugfixing rather than new code development. So > replies may be limited. > > Jan Hubicka is probably the best person to answer your questions, but > others may be able to do so as well. Indeed, I am bit swamped at the moment, but I will try to look at this at thursday. Sorry for being slow - feel free to ping me about any IPA/LTO related issues if I respond slowly or not at all. Honza > > jeff >
Re: Fw: GSoC topic: Implement hot cold splitting at GIMPLE IR level
Aditya, the hot/cold partitioning is currently organized as folows: 1) we have static branch prediction code in predict.c and profile feedback which we store into cfg and callgraph. 2) we have predicates optimize_*_for_speed_p/size_p where * can be function, basic block, cfg edge, callgraph edge, loop or loop nest 3) all optimization passes trading code size for speed should the corresponding predicaes about whether to do the transform. 4) ipa-split pass is mostly there to enable partial inlining 5) hot-cold partition in bb-reorder offlines stores 6) we have ipa-icf pass to identify functions and some code in FRE to identify code within single function. 7) we do shrink wrapping to mitigate register pressure problems caused in cold regions of code I think this is bit stronger to what llvm does currently which relies on outlining SESE regions earlier rather than going through the pain of implementing support for partitioning. Building clang10 with GCC and FDO leads to 37MB .text section Building clang10 with GCC and LTO+FDO leads to 34MB .text section Building clang10 with clang10 and FDO leads to 53MB .text section Building clang10 with clang10 and thinlto+FDO leads to 67MB .text section GCC built clang is about 2-3% faster building Firefox. There are many things which I think could/should imporve in our framework. 1) our size optimization is very agressive (llvms -Oz) and enaling it based on heuristics may lead to huge regressions. We probably want to have optimize_for_size_p predicate to have two levels and do less agressive mode in place we are not sure the code is really very cold. 2) ipa-split is very simplistic and only splits when there is no value computed in header of function used in the tail. We should support adding extra parameters for values computed and do more general SESE outlining Note that we do SESE outlining for openMP but this code is not interfaced very generically to be easilly used by ipa-split. Original implementation of ipa-split was kind of "first cut" trying to clean up interfaces to rest of the compiler and implement more fancy features later. This never happened so there is certainly space for imrovements here. We also do all splitting before actual IPA optimization while it may be more reasonable to identify potential split points and make IPA optimization to decide on transforms (currently we rely on inliner to inline back useless splits). 3) function partitioning is enabled only for x86. I never had time to get it working on other targets and no-one picked up this task 4) ipa-icf and in-function code merging is currently very conservative (I plan to work on this next stage1) comparing metadata like type based aliasing info. 5) we have only very limited logic to detect cold regions without profile feedback and thus amount of offlined code is very small (this also is because of 1). We basically know that code leading to abort/exception handling etc is cold and consider everything else hot. 6) We lack code placement pass (though Martin has WIP implementation of it) 7) We do no partitioning of data segment which may be also interesting to do. 8) Most of non-x86 backends do not implement very well the hot/cold code models and instruction choice. 9) Shrink-wrapping and register allocation is not always able to move spilling to code paths but this is generally very hard problem to track. So there are a lot of place for improvmeent (and I am sure more can be found) and I would be happy to help you with them. Honza > > Hi Martin, > Thank you for explaining the status quo. After reading the code of > bb-reorder.c, > it looks pretty good and seems it doesn't need any significant improvements. > In that case, the only value GIMPLE level hot/cold splitting could bring is > to enable aggressive code-size optimization > by merging of similar/identical functions: after outlining cold regions, they > may be suitable candidates for function merging. > ipa-split might be enabling some of that, having a region based function > splitting could improve ipa-split. > > -Aditya > > > -- > From: Martin Liška > Sent: Tuesday, March 3, 2020 2:47 AM > To: Aditya K ; gcc@gcc.gnu.org > Cc: Jan Hubicka > Subject: Re: GSoC topic: Implement hot cold splitting at GIMPLE IR level > > Hello. > Thank you for idea. I would like to provide some comments about what GCC can > currently > do and I'm curious we need something extra on top of what we do. > Right now, GCC can do hot/cold partitioning based on functions and basic > blocks. With > a PGO profile, the optimization is quite aggressive and can save quite some > code > being placed into a cold partitioning and being optimized for size. Without a > profile, > we do a static
Re: Comparing types at LTO time
> There doesn't seem to be a way to compare types at LTO time. The functions > same_type_p and comptypes are front end only if I'm not totally confused > (which is quite possible) and type_hash_eq doesn't seem to apply for > structure types. Please, any advice would be welcome. At LTO time it is bit hard to say what really is the "same type". We have multiple notions of equivalence: 1) TYPE_MAIN_VARIANT (t1) == TYPE_MAIN_VARIANT (t2) means that both types were equal in tree merging at stream in, this means that their IL representaiton is identical. This will lead to "false" for types that are same bud did not get merged for various reasons. One such valid reason, for example, is visibility of associated virtual tables 2) types_types_same_for_odr returns true if types are considered same by C++ one definition rule. This is reliable but works only for C++ types with names (structures and unions) 3) same_type_for_tbaa returns true if types are equivalent for type based alias analysis. It returns true in much more cases than 1 but is too coarse if you want to do datastructure changes. So in general this is quite hard problem (and in fact I started to play with ODR types originally to understand it better). I would suggest starting with 1 if you want to rewrite types and eventually add a new comparsion once pass does something useful. Richard may have some extra insights. Honza > > Thanks, > > Gary Oblock >
Re: [EXT] Re: Option processing question
> On Mon, Jan 13, 2020 at 9:28 PM Gary Oblock wrote: > > > > If I have an optimization set up to run in LTO (call it my_opt) and > > the flags -flto-partition=one, -flto options and -fmy-opt are set > > then the optimization might or might not be run depending of whether > > it can all fit in one partition. > > > > What I'm thinking is as long as it's a fatal error detectable anywhere > > upstream in the compilation to not specify -fmy-opt without > > -flto-partition=one > > then all will be well. So how do I detect it at all and where would I put > > the checking? > > I guess you could always check the symbols used_from_other_partition > and/or in_other_partition flags? That is, it might be perfectly possible > for some corner-cases to do the transform independently with more than one > partition if the partitioning happens to isolate things perfectly? Well, if you do optimization at ltrans you basically see symbols in ohter partitions as external so there is not that much of practical difference in the implementation of analysis except that it will catch fewer transforms. But the idea for keeping -flto-partition=none around was to make it possible to write a prototype of IPA pass that does not nees to play with partitioning and incrementally add WPA bits. It seems that this strategy may work here - having a prototype that works only w/o WHOPR but produces valid code & does something useful seems like good first step (which is hard enough at its own). So I would suggest to look at -flto-partition=none first and keep in mind that eventually we will want a solution that plays well with partitioning. Honza > > Richard. > > > Gary > > ____________ > > From: Richard Biener > > Sent: Monday, January 13, 2020 2:30 AM > > To: Gary Oblock ; Jan Hubicka > > Cc: gcc@gcc.gnu.org > > Subject: [EXT] Re: Option processing question > > > > External Email > > > > -- > > On Sat, Jan 11, 2020 at 4:47 AM Gary Oblock wrote: > > > > > > I'm writing an LTO optimization that requires "-flto-partition=one" How > > > can I make > > > sure that this is the case? I've spent hours greping the code and the > > > Internals Doc is > > > worth less than nothing for something like this. > > > > That's of course because you shouldn't be doing this ;) > > > > > If you have an answer or even > > > I good idea of where to look please let me know. Note, for an normal > > > "-fbalh" there's > > > a flag_blah that I could look at but for this there seems to be zip, nil, > > > diddly squat, etc. > > > > At LTRANS time you indeed don't know. > > > > But there's -flto-partition=none (none, not one), that you can detect > > somehow > > (I think in_lto_p && !flag_ltrans && !flag_wpa). > > > > Richard. > > > > > Many thanks, > > > > > > Gary
Re: Question about undefined functions' parameters during LTO
> Hello, Hello, > > I am trying to find out the arguments of functions which are undefined > during LTO. > > Basically: > > gcc_assert(in_lto_p && !cnode->definition) > // Do we have arguments? > gcc_assert(DECL_ARGUMENTS(cnode->decl)) // fails > // No, we don't. > > As I understand it, functions which are not defined are ones which have have > been declared external. LTO works in a way that function bodies are loaded from the stream on demand (using node->get_body or node->get_untransformed_body calls). Declaration of function parameters are part of function body and that is why they are missing. You can get all bodies if you experiment with your pass using the call above. If you want real LTO pass working at whole propgram eventualy you will need avoid reading all bodies (or WPA stage will be slow and memory hungry) and store info you need into summaries. ipa-prop already stores nunber and types of arguments that may be all you need. > > I believe that, when building an application with -flto, the only functions > which are not visible during LTO **and** are declared external are functions > defined in libraries which have not been compiled with -flto. An example of > this is glibc. > > Indeed, I have implemented an analysis pass in gcc which prints out > undefined functions, and it prints out the following: > > undefined function __gcov_merge_add > undefined function fopen > undefined function printf > undefined function __builtin_putchar > undefined function calloc > undefined function __gcov_merge_topn > undefined function strtol > undefined function free > ... and more > > Now, I am not interested in the bodies of these. I am only interested in > determining the type of the arguments passed to these functions. However, > when I call the following function: See ipa_get_type Honza > > ``` > void > print_parameters_undefined_functions(const cgraph_node *cnode) > { > gcc_assert(cnode); > gcc_assert(in_lto_p); > gcc_assert(!cnode->definition); > > tree function = cnode->decl; > gcc_assert(function); > enum tree_code code = TREE_CODE (function); > bool is_function_decl = FUNCTION_DECL == code; > gcc_assert (is_function_decl); > > log("about to print decl_arguments(%s)\n", cnode->name()); > for (tree parm = DECL_ARGUMENTS (function); parm; parm = DECL_CHAIN(parm)) > { > log("hello world\n"); > } > ``` > > I never see "hello world" but I do see "about to print...". > Does anyone have any idea on how to obtain the arguments to undefined > functions? > > The only way I see to do this, is to walk through the gimple instructions, > find GIMPLE_CALL statements and look at the argument list at that moment. > But I was wondering if there's a more efficient way to do it. > > Thanks!
Re: Thought on inlining indirect function calls
> > I think it is because during early opts we do not know if f is not > > modified by atoi call and at IPA time (where we already know that from > > resolution info) we do not have jump functions to track global variables. > > I confirm that if I remove the calls to atoi, the function is correctly > inlined. > Note however that, if I declare f as static, such as it can theoretically not > be > impacted by atoi, the compiler still doesn't inline the function. The problem here is that compiler thinks that atoi may call back some static function in unit that may change value of the variable. There is no analysis implemented on how such calls back to unit can affect the value. We have leaf attribute that you can annotate atoi (but glibc does not do that) which should make gcc to work out the propagation. Honza > > Yet, if STEP 1 was an issue, and that your answers has provided some > clarification, > what is about the STEP 2? As far as I know, it is not supported, it is? > > Frédéric Recoules > > > - Mail original - > De: "Jan Hubicka" > À: "Richard Biener" > Cc: "gcc" , "FRÉDÉRIC RECOULES" > , mjam...@suse.cz > Envoyé: Samedi 14 Mars 2020 14:05:07 > Objet: Re: Thought on inlining indirect function calls > > > >I was pretty disappointed to see that even if the compiler knows we are > > >calling f_add, it doesn't inline the call (it ends up with "call > > >f_add"). > > > > It's probably because we know it's only called once and thus not > > performance relevant. Try put it into a loop. > > I think it is because during early opts we do not know if f is not > modified by atoi call and at IPA time (where we already know that from > resolution info) we do not have jump functions to track global variables. > > Honza > > > > Richard. > > > > >I can but only suppose it is because its address is taken and from a > > >blind black box user perspective, it doesn't sound too difficult to > > >completely inline it. > > > > > >STEP 2: statically known as being among a pool of less than > > > (arbitrarily fixed = 2) N functions > > > > > >#include > > >#include > > >#include > > > > > >int main (int argc, char *argv[]) > > >{ > > > int x, y, z; > > > enum f_e e; > > > if (argc < 4) return -1; > > > if (strcmp(argv[1], "add") == 0) > > > e = ADD; > > > else if (strcmp(argv[1], "sub") == 0) > > > e = SUB; > > > else return -1; > > > f_init(e); > > > x = atoi(argv[2]); > > > y = atoi(argv[3]); > > > z = f(x, y); > > > printf("%d\n", z); > > > return 0; > > >} > > > > > >Here the compiler can't know at compile time the function that will be > > >called but I suppose that it knows that it will be either f_add or > > >f_sub. > > >A simple work around would be for the compiler to test at the call site > > >the value of f and inline the call thereafter: > > > > > > if (f == _add) > > > z = f_add(x, y); > > > else if (f == _sub) > > > z = f_sub(x, y); > > > else __builtin_unreachable(); /* or z = f(x, y) to be conservative */ > > > > > >Once again, this transformation don't sound too complicated to > > >implement. > > >Still, easy to say-so without diving into the compiler's code. > > > > > > > > >I hope it will assist you in your reflections, > > >Have a nice day, > > >Frédéric Recoules > > >
Re: Thought on inlining indirect function calls
> >I was pretty disappointed to see that even if the compiler knows we are > >calling f_add, it doesn't inline the call (it ends up with "call > >f_add"). > > It's probably because we know it's only called once and thus not performance > relevant. Try put it into a loop. I think it is because during early opts we do not know if f is not modified by atoi call and at IPA time (where we already know that from resolution info) we do not have jump functions to track global variables. Honza > > Richard. > > >I can but only suppose it is because its address is taken and from a > >blind black box user perspective, it doesn't sound too difficult to > >completely inline it. > > > >STEP 2: statically known as being among a pool of less than > >(arbitrarily fixed = 2) N functions > > > >#include > >#include > >#include > > > >int main (int argc, char *argv[]) > >{ > >int x, y, z; > >enum f_e e; > >if (argc < 4) return -1; > >if (strcmp(argv[1], "add") == 0) > > e = ADD; > >else if (strcmp(argv[1], "sub") == 0) > > e = SUB; > >else return -1; > >f_init(e); > >x = atoi(argv[2]); > >y = atoi(argv[3]); > >z = f(x, y); > >printf("%d\n", z); > >return 0; > >} > > > >Here the compiler can't know at compile time the function that will be > >called but I suppose that it knows that it will be either f_add or > >f_sub. > >A simple work around would be for the compiler to test at the call site > >the value of f and inline the call thereafter: > > > >if (f == _add) > >z = f_add(x, y); > >else if (f == _sub) > >z = f_sub(x, y); > > else __builtin_unreachable(); /* or z = f(x, y) to be conservative */ > > > >Once again, this transformation don't sound too complicated to > >implement. > >Still, easy to say-so without diving into the compiler's code. > > > > > >I hope it will assist you in your reflections, > >Have a nice day, > >Frédéric Recoules >
Re: LTO slows down calculix by more than 10% on aarch64
> Thanks for the suggestions. > Is it possible to modify assembly files emitted after ltrans phase ? > IIUC, the linker invokes lto1 twice, for wpa and ltrans,and then links > the obtained object files which doesn't make it possible to hand edit > assembly files post ltrans ? > In particular, I wanted to modify calculix.ltrans16.ltrans.s, which > contains e_c3d to avoid the extra branch. > (If that doesn't work out, I can proceed with manually inlining in the > source and then modifying generated assembly). It is not intended to work that way, but for smaller benchmark you can just keep the .s files, modify them and then compile again with gfortran *.s or so. Honza > > Thanks, > Prathamesh > > > > Richard. > > > > > Thanks, > > > Prathamesh > > > > > > > > > which corresponds to the following loop in line 1014. > > > > > do n1=1,3 > > > > > s(iii1,jjj1)=s(iii1,jjj1) > > > > > & +anisox(m1,k1,n1,l1) > > > > > & *w(k1,l1)*vo(i1,m1)*vo(j1,n1) > > > > > & *weight > > > > > > > > > > I am not sure why would hoisting have any direct effect on this loop > > > > > except perhaps that hoisting allocated more reigsters, and led to > > > > > increased register pressure. Perhaps that's why it's using highered > > > > > number regs for code-gen in inlined version ? However disabling > > > > > hoisting in blocks 173 and 181, also leads to overall 6 extra spills > > > > > (by grepping for str to sp), so > > > > > hoisting is also helping here ? I am not sure how to proceed further, > > > > > and would be grateful for suggestions. > > > > > > > > > > Thanks, > > > > > Prathamesh
Re: [GSoC] Automatic Parallel Compilation Viability -- Final Report
> On Fri, Aug 28, 2020 at 10:32 PM Giuliano Belinassi > wrote: > > > > Hi, > > > > This is the final report of the "Automatic Parallel Compilation > > Viability" project. Please notice that this report is pretty > > similar to the delivered from the 2nd evaluation, as this phase > > consisted of mostly rebasing and bug fixing. > > > > Please, reply this message for any question or suggestion. > > Thank you for your great work Giuliano! Indeed, it is quite amazing work :) > > It's odd that LTO emulated parallelism is winning here, > I'd have expected it to be slower. One factor might > be different partitioning choices and the other might > be that the I/O required is faster than the GC induced > COW overhead after forking. Note you can optimize > one COW operation by re-using the main process for > compiling the last partition. I suppose you tested > this on a system with a fast SSD so I/O overhead is > small? At the time I implemented fork based parallelism for WPA (which I think we could recover by bit generalizing guiliano's patches), I had same outcome: forked ltranses was simply running slower than those after streaming. This was however tested on Firefox in my estimate sometime around 2013. I never tried it on units comparable to insn-emit (which would be differnt at that time anyway). I was mostly aiming to get it fully transparent with streaming but never quite finished it since, at that time, it I tought time is better spent on optimizing LTO data layout. I suppose we want to keep both mechanizms in both WPA and normal compilation and make compiler to choose fitting one. Honza > > Thanks again, > Richard. > > > Thank you, > > Giuliano. > > > > --- 8< --- > > > > # Automatic Parallel Compilation Viability: Final Report > > > > ## Complete Tasks > > > > For the third evaluation, we expected to deliver the product as a > > series of patches for trunk. The patch series were in fact delivered > > [1], but several items must be fixed before merge. > > > > > > Overall, the project works and speedups ranges from 0.95x to 3.3x. > > Bootstrap is working, and therefore this can be used in an experimental > > state. > > > > ## How to use > > > > 1. Clone the autopar_devel branch: > > ``` > > git clone --single-branch --branch devel/autopar_devel \ > > git://gcc.gnu.org/git/gcc.git gcc_autopar_devel > > ``` > > 2. Follow the standard compilation options provided in the Compiling > > GCC page, and install it on some directory. For instance: > > > > ``` > > cd gcc_autopar_devel > > mkdir build && cd build > > ../configure --disable-bootstrap --enable-languages=c,c++ > > make -j 8 > > make DESTDIR=/tmp/gcc11_autopar install > > ``` > > > > 3. If you want to test whether your version is working, just launch > > Gcc with `-fparallel-jobs=2` when compiling a file with -c. > > > > 5. If you want to compile a project with this version it uses GNU > > Makefiles, you must modify the compilation rule command and prepend a > > `+` token to it. For example, in Git's Makefile, Change: > > ``` > > $(C_OBJ): %.o: %.c GIT-CFLAGS $(missing_dep_dirs) > > $(QUIET_CC)$(CC) -o $*.o -c $(dep_args) $(ALL_CFLAGS) > > $(EXTRA_CPPFLAGS) $< > > ``` > > to: > > ``` > > $(C_OBJ): %.o: %.c GIT-CFLAGS $(missing_dep_dirs) > > +$(QUIET_CC)$(CC) -o $*.o -c $(dep_args) $(ALL_CFLAGS) > > $(EXTRA_CPPFLAGS) $< > > ``` > > as well as point the CC variable to the installed gcc, and > > append a `-fparallel-jobs=jobserver` on your CFLAGS variable. > > > > # How the parallelism works in this project > > > > In LTO, the Whole Program Analysis decides how to partition the > > callgraph for running the LTRANS stage in parallel. This project > > works very similar to this, however with some changes. > > > > The first was to modify the LTO structure so that it accepts > > the compilation without IR streaming to files. This avoid an IO > > overhead when compiling in parallel. > > > > The second was to use a custom partitioner to find which nodes > > should be in the same partition. This was mainly done to bring COMDAT > > together, as well as symbols that are part of other symbols, and even > > private symbols so that we do not output hidden global symbols. > > > > However, experiment showed that bringing private symbols together did > > not yield a interesting speedup on some large files, and therefore > > we implemented two modes of partitioning: > > > > 1. Partition without static promotion. This is the safer method to use, > > as we do not modify symbols in the Compilation Unit. This may lead to > > speedups in files that have multiple entries points with low > > connectivity between then (such as insn-emit.c), however this will not > > provide speedups when this hypothesis is not true (gimple-match.c is an > > example of this). This is the default mode. > > > > 2. Partition with static promotion to global. This is a more aggressive > > method, as we can decide to promote some functions to global to increase
Re: [GSoC] Automatic Parallel Compilation Viability -- Final Report
> > I guess before investigating the current state in detail > > it might be worth exploring Honzas wish of sharing > > the actual partitioning code between LTO and -fparallel-jobs. > > > > Note that larger objects take a bigger hit from the GC COW > > issue so at some point that becomes dominant because the > > first GC walk for each partition is the same as doing a GC > > walk for the whole object. Eventually it makes sense to > > turn off GC completely for smaller partitions. > > Just a side note, I added a ggc_collect () before start forking > and it did not improve things. We was discussing it on IRC. Your problem is similar to what precompiled headers are hitting, too. They essentially make a memory dump of GGC memory that is mmaped back at the time the compiler recognizes corresponding #include. There is mechanism in garbage collector to push and pop context which was partly removed in meantime. Pushing context makes all data in garbage collector permanent until you popo it again and the marking is not supposed to dirtify the pages. It is still done for PCH, see ggc_pch_read and the line setting contxt_depth to 1. In theory setting context-depth to 1 before forking and fixing paces we dirtify the pages should help both the fork based parallelism and the PCHs. As richi mentioned the ggc seems to set in_use_p anyway that may trigger unwanted COW. I would also trim memory pool and malloc before forking, that should be cheap and work. It is already done by WPA before fork as well. Honza > > Thank you, > Giuliano. > > > > > Richard. > > > > > Honza > > > > > > > > Thanks again, > > > > Richard. > > > > > > > > > Thank you, > > > > > Giuliano. > > > > > > > > > > --- 8< --- > > > > > > > > > > # Automatic Parallel Compilation Viability: Final Report > > > > > > > > > > ## Complete Tasks > > > > > > > > > > For the third evaluation, we expected to deliver the product as a > > > > > series of patches for trunk. The patch series were in fact delivered > > > > > [1], but several items must be fixed before merge. > > > > > > > > > > > > > > > Overall, the project works and speedups ranges from 0.95x to 3.3x. > > > > > Bootstrap is working, and therefore this can be used in an > > > > > experimental > > > > > state. > > > > > > > > > > ## How to use > > > > > > > > > > 1. Clone the autopar_devel branch: > > > > > ``` > > > > > git clone --single-branch --branch devel/autopar_devel \ > > > > > git://gcc.gnu.org/git/gcc.git gcc_autopar_devel > > > > > ``` > > > > > 2. Follow the standard compilation options provided in the Compiling > > > > > GCC page, and install it on some directory. For instance: > > > > > > > > > > ``` > > > > > cd gcc_autopar_devel > > > > > mkdir build && cd build > > > > > ../configure --disable-bootstrap --enable-languages=c,c++ > > > > > make -j 8 > > > > > make DESTDIR=/tmp/gcc11_autopar install > > > > > ``` > > > > > > > > > > 3. If you want to test whether your version is working, just launch > > > > > Gcc with `-fparallel-jobs=2` when compiling a file with -c. > > > > > > > > > > 5. If you want to compile a project with this version it uses GNU > > > > > Makefiles, you must modify the compilation rule command and prepend a > > > > > `+` token to it. For example, in Git's Makefile, Change: > > > > > ``` > > > > > $(C_OBJ): %.o: %.c GIT-CFLAGS $(missing_dep_dirs) > > > > > $(QUIET_CC)$(CC) -o $*.o -c $(dep_args) $(ALL_CFLAGS) > > > > > $(EXTRA_CPPFLAGS) $< > > > > > ``` > > > > > to: > > > > > ``` > > > > > $(C_OBJ): %.o: %.c GIT-CFLAGS $(missing_dep_dirs) > > > > > +$(QUIET_CC)$(CC) -o $*.o -c $(dep_args) $(ALL_CFLAGS) > > > > > $(EXTRA_CPPFLAGS) $< > > > > > ``` > > > > > as well as point the CC variable to the installed gcc, and > > > > > append a `-fparallel-jobs=jobserver` on your CFLAGS variable. > > > > > > > > > > # How the parallelism works in this project > > > > > > > > > > In LTO, the Whole Program Analysis decides how to partition the > > > > > callgraph for running the LTRANS stage in parallel. This project > > > > > works very similar to this, however with some changes. > > > > > > > > > > The first was to modify the LTO structure so that it accepts > > > > > the compilation without IR streaming to files. This avoid an IO > > > > > overhead when compiling in parallel. > > > > > > > > > > The second was to use a custom partitioner to find which nodes > > > > > should be in the same partition. This was mainly done to bring COMDAT > > > > > together, as well as symbols that are part of other symbols, and even > > > > > private symbols so that we do not output hidden global symbols. > > > > > > > > > > However, experiment showed that bringing private symbols together did > > > > > not yield a interesting speedup on some large files, and therefore > > > > > we implemented two modes of partitioning: > > > > > > > > > > 1. Partition without static promotion. This is the safer method to > > > > > use, > > > > > as we do not
Re: Where did my function go?
> On Tue, Oct 20, 2020 at 1:02 PM Martin Jambor wrote: > > > > Hi, > > > > On Tue, Oct 20 2020, Richard Biener wrote: > > > On Mon, Oct 19, 2020 at 7:52 PM Gary Oblock > > > wrote: > > >> > > >> Richard, > > >> > > >> I guess that will work for me. However, since it > > >> was decided to remove an identical function, > > >> why weren't the calls to it adjusted to reflect it? > > >> If the call wasn't transformed that means it will > > >> be mapped at some later time. Is that mapping > > >> available to look at? Because using that would > > >> also be a potential solution (assuming call > > >> graph information exists for the deleted function.) > > > > > > I'm not sure how the transitional cgraph looks like > > > during WPA analysis (which is what we're talking about?), > > > but definitely the IL is unmodified in that state. > > > > > > Maybe Martin has an idea. > > > > > > > Exactly, the cgraph_edges is where the correct call information is > > stored until the inlining transformation phase calls > > cgraph_edge::redirect_call_stmt_to_callee is called on it - inlining is > > a special pass in this regard that performs this IPA-infrastructure > > function in addition to actual inlining. > > > > In cgraph means the callee itself but also information in > > e->callee->clone.param_adjustments which might be interesting for any > > struct-reorg-like optimizations (...and in future possibly in other > > transformation summaries). > > > > The late IPA passes are in very unfortunate spot here since they run > > before the real-IPA transformation phases but after unreachable node > > removals and after clone materializations and so can see some but not > > all of the changes performed by real IPA passes. The reason for that is > > good cache locality when late IPA passes are either not run at all or > > only look at small portion of the compilation unit. In such case IPA > > transformations of a function are followed by all the late passes > > working on the same function. > > > > Late IPA passes are unfortunately second class citizens and I would > > strongly recommend not to use them since they do not fit into our > > otherwise robust IPA framework very well. We could probably provide a > > mechanism that would allow late IPA passes to run all normal IPA > > transformations on a function so they could clearly see what they are > > looking at, but extensive use would slow compilation down so its use > > would be frowned upon at the very least. > > So IPA PTA does get_body () on the nodes it wants to analyze and I > thought that triggers any pending IPA transforms? Yes, it does (and get_untransormed_body does not) Honza > > Richard. > > > Martin > >
Re: Where did my function go?
> > On Tue, Oct 20, 2020 at 1:02 PM Martin Jambor wrote: > > > > > > Hi, > > > > > > On Tue, Oct 20 2020, Richard Biener wrote: > > > > On Mon, Oct 19, 2020 at 7:52 PM Gary Oblock > > > > wrote: > > > >> > > > >> Richard, > > > >> > > > >> I guess that will work for me. However, since it > > > >> was decided to remove an identical function, > > > >> why weren't the calls to it adjusted to reflect it? > > > >> If the call wasn't transformed that means it will > > > >> be mapped at some later time. Is that mapping > > > >> available to look at? Because using that would > > > >> also be a potential solution (assuming call > > > >> graph information exists for the deleted function.) > > > > > > > > I'm not sure how the transitional cgraph looks like > > > > during WPA analysis (which is what we're talking about?), > > > > but definitely the IL is unmodified in that state. > > > > > > > > Maybe Martin has an idea. > > > > > > > > > > Exactly, the cgraph_edges is where the correct call information is > > > stored until the inlining transformation phase calls > > > cgraph_edge::redirect_call_stmt_to_callee is called on it - inlining is > > > a special pass in this regard that performs this IPA-infrastructure > > > function in addition to actual inlining. > > > > > > In cgraph means the callee itself but also information in > > > e->callee->clone.param_adjustments which might be interesting for any > > > struct-reorg-like optimizations (...and in future possibly in other > > > transformation summaries). > > > > > > The late IPA passes are in very unfortunate spot here since they run > > > before the real-IPA transformation phases but after unreachable node > > > removals and after clone materializations and so can see some but not > > > all of the changes performed by real IPA passes. The reason for that is > > > good cache locality when late IPA passes are either not run at all or > > > only look at small portion of the compilation unit. In such case IPA > > > transformations of a function are followed by all the late passes > > > working on the same function. > > > > > > Late IPA passes are unfortunately second class citizens and I would > > > strongly recommend not to use them since they do not fit into our > > > otherwise robust IPA framework very well. We could probably provide a > > > mechanism that would allow late IPA passes to run all normal IPA > > > transformations on a function so they could clearly see what they are > > > looking at, but extensive use would slow compilation down so its use > > > would be frowned upon at the very least. > > > > So IPA PTA does get_body () on the nodes it wants to analyze and I > > thought that triggers any pending IPA transforms? > > Yes, it does (and get_untransormed_body does not) And to bit correct Maritn's explanation: the late IPA passes are intended to work, though I was mostly planning them for prototyping true ipa passes and also possibly for implementing passes that inspect only few functions. IPA transforms happens when get_body is called. With LTO this also trigger reading the body from disk. So if you want to see all bodies and work on them, you can simply call get_body on everything but it will result in increased memory use since everything will be loaded form disk and expanded (by inlining) at once instead of doing it on per-function basis. get_body is simply mean to arrange the body on demand. The passmanager uses it before late passes are executed and ipa-pta uses it before it builds constraints (that is not good for reasons described above). Clone materialization is also triggered by get_body. The clone materialization pass mostly happens to remove unreachable function bodies. I plan to get rid of it, since as we are now better on doing ipa transforms it brings in a lot of bodies already. For cc1plus it is well over 1GB of memory. Honza > > Honza > > > > Richard. > > > > > Martin > > >
Re: Git rejecting branch merge
> On Tue, Sep 29, 2020 at 9:17 AM Jan Hubicka wrote: > > > > Hello, > > I am trying to update me/honza-gcc-benchmark-branch to current trunk > > which I do by deleting it, recreating locally and pushing out. > > > > The problem is that the push fails witih: > > > > remote: *** The following commit was rejected by your > > hooks.commit-extra-checker script (status: 1) > > remote: *** commit: 03e87724864a17e22c9b692cc0caa014e9dba6b1 > > remote: *** The first line of a commit message should be a short > > description of the change, not a single word. > > remote: error: hook declined to update > > > > The broken commit is: > > > > $ git show 03e87724864a17e22c9b692cc0caa014e9dba6b1 > > > > commit 03e87724864a17e22c9b692cc0caa014e9dba6b1 > > Author: Georg-Johann Lay > > Date: Tue Jan 14 11:09:38 2020 +0100 > > It's odd that your branch contains a change as old as this? I use it for benchmarking on lnt and it is indeed usued for some time. > > Maybe you can rebase onto sth more recent. That is what I am trying to do. But my understanding is that in meantime the hook checking for more than one word in description was implemented and thus I can't. Perhaps I can just remove the remote branch and create again? Honza > > > > > Typo. > > libgcc/ > > * config/avr/lib1funcs.S (skip): Simplify. > > > > > > I wonder is there a way to get this done? > > > > Thanks, > > Honza
Re: Git rejecting branch merge
> On Tue, 29 Sep 2020, Joel Brobecker wrote: > > > > > That's correct. The commit-extra-checker is called using the same list > > > > of "added commits" as the other checks implemented in the hooks, and > > > > that list excludes all commits accessible from existing references > > > > in the repository. > > > > > > Since 03e87724864a17e22c9b692cc0caa014e9dba6b1 has been in the repository > > > (on master) since before GCC 10 branched, something must be going wrong > > > for a push to be rejected based on a check of that commit. > > > > OK. Can you create a tarball of the GCC's bare repository as it is now, > > and give me access to both this tarball and the branch that the user > > was trying to update, I can try to spend a bit of time this weekend > > trying to reproduce and then investigating it. And just for the avoidance > > of doubt, if I could get the git command that was used to attempt > > the push, this would avoid wasting time investigating the wrong thing. > > It's /git/gcc.git on sourceware (I think you have shell access, or can git > clone --mirror to get a full bare copy). > refs/users/hubicka/heads/honza-gcc-benchmark-branch currently points to > c478047c0fd71e8bd8e069c729b57a89b75ee004, "Add changelog". I don't know > exactly what Honza is pushing there (whether it's a merge or a rebase), > but whatever he's pushing, the hook should not be checking commits that > are already in the repository. I did the following (and perhaps you can do it for me) git branch -D me/honza-gcc-benchmark-branch git checkout master git co -b me/honza-gcc-benchmark-branch git push -f Which I hope should replace the existing (old) branch by a copy of current trunk on which I could do more of tuning tests. Honza > > -- > Joseph S. Myers > jos...@codesourcery.com
Re: Git rejecting branch merge
> On Tue, Sep 29, 2020 at 07:16:45PM +0200, Jan Hubicka wrote: > > > On Tue, 29 Sep 2020, Joel Brobecker wrote: > > > > > > > > > That's correct. The commit-extra-checker is called using the same > > > > > > list > > > > > > of "added commits" as the other checks implemented in the hooks, and > > > > > > that list excludes all commits accessible from existing references > > > > > > in the repository. > > > > > > > > > > Since 03e87724864a17e22c9b692cc0caa014e9dba6b1 has been in the > > > > > repository > > > > > (on master) since before GCC 10 branched, something must be going > > > > > wrong > > > > > for a push to be rejected based on a check of that commit. > > > > > > > > OK. Can you create a tarball of the GCC's bare repository as it is now, > > > > and give me access to both this tarball and the branch that the user > > > > was trying to update, I can try to spend a bit of time this weekend > > > > trying to reproduce and then investigating it. And just for the > > > > avoidance > > > > of doubt, if I could get the git command that was used to attempt > > > > the push, this would avoid wasting time investigating the wrong thing. > > > > > > It's /git/gcc.git on sourceware (I think you have shell access, or can > > > git > > > clone --mirror to get a full bare copy). > > > refs/users/hubicka/heads/honza-gcc-benchmark-branch currently points to > > > c478047c0fd71e8bd8e069c729b57a89b75ee004, "Add changelog". I don't know > > > exactly what Honza is pushing there (whether it's a merge or a rebase), > > > but whatever he's pushing, the hook should not be checking commits that > > > are already in the repository. > > > > I did the following (and perhaps you can do it for me) > > > > git branch -D me/honza-gcc-benchmark-branch > > git checkout master > > git co -b me/honza-gcc-benchmark-branch > > git push -f > > > > Which I hope should replace the existing (old) branch by a copy of > > current trunk on which I could do more of tuning tests. > > So, IIUC, you're trying to replace the old > refs/users/hubicka/heads/honza-gcc-benchmark-branch > with whatever is on master at this moment, is that correct > (that would wipe whatever changes you've made in your old > branch)? > > For instance, currently master is pointing to commit > adcf8a11c772e7a0c64d4ae3eb19a520566f32b9. Yes, i want to replace it with current master. Thanks, Honza > > -- > Joel
Re: Git rejecting branch merge
> > So, IIUC, you're trying to replace the old > > refs/users/hubicka/heads/honza-gcc-benchmark-branch > > with whatever is on master at this moment, is that correct > > (that would wipe whatever changes you've made in your old > > branch)? > > > > For instance, currently master is pointing to commit > > adcf8a11c772e7a0c64d4ae3eb19a520566f32b9. > > Yes, i want to replace it with current master. Hello, I wonder I can get the branch moved, so I can do the benchmarking :) Any suggestions how to do that? Thanks, Honza > > Thanks, > Honza > > > > -- > > Joel
Git rejecting branch merge
Hello, I am trying to update me/honza-gcc-benchmark-branch to current trunk which I do by deleting it, recreating locally and pushing out. The problem is that the push fails witih: remote: *** The following commit was rejected by your hooks.commit-extra-checker script (status: 1) remote: *** commit: 03e87724864a17e22c9b692cc0caa014e9dba6b1 remote: *** The first line of a commit message should be a short description of the change, not a single word. remote: error: hook declined to update The broken commit is: $ git show 03e87724864a17e22c9b692cc0caa014e9dba6b1 commit 03e87724864a17e22c9b692cc0caa014e9dba6b1 Author: Georg-Johann Lay Date: Tue Jan 14 11:09:38 2020 +0100 Typo. libgcc/ * config/avr/lib1funcs.S (skip): Simplify. I wonder is there a way to get this done? Thanks, Honza
Re: GCC usage script
> Hey. > > Some of you may know Honza's images about CPU utilization and memory usage > when using LTO [1]. Last week I played with Chromium and Firefox and I wanted > to visualize their utilization. That's why I came up with a new script [2]. > > You can easily wrap a command and the script generates graphs for you:. > The script uses modern psutil library and plots through a great matplotlib: > $ usage-wrapper.py -v -t postgresql 'make -j16' > ... > $ eog usage.svg > > There's a gallery of the collected reports for Firefox and Chromium: > https://gist.github.com/marxin/223890df4d8d8e490b6b2918b77dacad Thanks, it looks nice. I am not sure what the blue line means - if it is overall usage of vmstat then I guess you do not calculate COW of the stream out processes correctly and that is why you get so high peak on them. > > and yes, we have a LTO regression since GCC 9 that I'm going to take closer > to. Where do you see the regression? (just trying to make sense of the numbers). For Firefox I see you get around 16GB streaming memory peak and 32GB for ltranses which seems kind of acceptable for such a large parallelism. Do you have time report for Cromium WPA? It seems to be stuck on something for quite a while. I would say that simple time report + ideally perf profile should make it possible to track this down. Honza > > Martin > > [1] > https://4.bp.blogspot.com/-fvrkYSMBqug/XMsTqg4HEkI/Gl8/8sp1GWv6Oe8tfBfL6aO8Nbq5j3hExurpwCEwYBhgL/s1600/llvm.png > [2] https://github.com/marxin/script-misc/blob/master/usage-wrapper.py
Re: gcc does not reduce the function call to the result if called function is not static when using -O2, only with -O3, clang and msvc do the optimization also with -O2
> gcc does not reduce to call result if called function is not static in > -O2 (will do with -O2) > clang and msvc does it also in -O2 regardless of the function beeing > static or not > > can someone explain to me why the -O2 optimizer is not able(allowed) to > reduce this small sample the same way as clang/msvc? GCC is optimizing for size here, so it is more careful with inlning. This is because it knows that function main is called once. If you rename main to something else or add a loop around the call, then it will inline for speed and do same work as clang. Clang (and probaby msvc) does not implement the heuristics that certain functions (static constructors, destructors, main and noreturns) are called once so they probably both optimize for speed. Even when optimizing for size it would be good idea to inline. However the inliner heruistics predicts that it is not. This is because at the inlining time compiler does not see that calee will optimize to constant. The reason is that you store the temporary vlues to array and those are not tracked. If you used scalar variables it would be able to constant fold everything early. Handling this would require either recovering early ipa-sra or adding return functions for values passed by reference. Honza
Re: Detect EAF flags in ipa-modref
Hi, this is updated patch for autodetection of EAF flags. Still the goal is to avoid fancy stuff and get besic logic in place (so no dataflow, no IPA propagation, no attempts to handle trickier cases). There is one new failure ./gcc/testsuite/gcc/gcc.sum:FAIL: gcc.dg/sso/t2.c -Wno-scalar-storage-order -O1 -fno-inline output pattern test ./gcc/testsuite/gcc/gcc.sum:FAIL: gcc.dg/sso/t2.c -Wno-scalar-storage-order -O2 output pattern test ./gcc/testsuite/gcc/gcc.sum:FAIL: gcc.dg/sso/t2.c -Wno-scalar-storage-order -Os output pattern test Which I blieve is bug exposed by detecting dump function to be EAF_DIRECT and NOESCAPE (which it is) and packing/updacking the bitfields leads in one bit difference. Still no idea why. Patch seems to be quite effective on cc1plus turning: Alias oracle query stats: refs_may_alias_p: 65808750 disambiguations, 75664890 queries ref_maybe_used_by_call_p: 153485 disambiguations, 66711204 queries call_may_clobber_ref_p: 22816 disambiguations, 28889 queries nonoverlapping_component_refs_p: 0 disambiguations, 36846 queries nonoverlapping_refs_since_match_p: 27271 disambiguations, 58917 must overlaps, 86958 queries aliasing_component_refs_p: 65808 disambiguations, 2067256 queries TBAA oracle: 25929211 disambiguations 60395141 queries 12391384 are in alias set 0 10783783 queries asked about the same object 126 queries asked about the same alias set 0 access volatile 9598698 are dependent in the DAG 1691939 are aritificially in conflict with void * Modref stats: modref use: 14284 disambiguations, 53336 queries modref clobber: 1660281 disambiguations, 2130440 queries 4311165 tbaa queries (2.023603 per modref query) 685304 base compares (0.321673 per modref query) PTA query stats: pt_solution_includes: 959190 disambiguations, 13169678 queries pt_solutions_intersect: 1050969 disambiguations, 13246686 queries Alias oracle query stats: refs_may_alias_p: 66914578 disambiguations, 76692648 queries ref_maybe_used_by_call_p: 244077 disambiguations, 67732086 queries call_may_clobber_ref_p: 111475 disambiguations, 116613 queries nonoverlapping_component_refs_p: 0 disambiguations, 37091 queries nonoverlapping_refs_since_match_p: 27267 disambiguations, 59019 must overlaps, 87056 queries aliasing_component_refs_p: 65870 disambiguations, 2063394 queries TBAA oracle: 26024415 disambiguations 60579490 queries 12450910 are in alias set 0 10806673 queries asked about the same object 125 queries asked about the same alias set 0 access volatile 9605837 are dependent in the DAG 1691530 are aritificially in conflict with void * Modref stats: modref use: 14272 disambiguations, 277680 queries modref clobber: 1669753 disambiguations, 7849135 queries 4330162 tbaa queries (0.551674 per modref query) 699241 base compares (0.089085 per modref query) PTA query stats: pt_solution_includes: 1833920 disambiguations, 13846032 queries pt_solutions_intersect: 1093785 disambiguations, 13309954 queries So almost twice as many pt_solution_includes disambiguations. I will re-run the stats overnight to be sure that it is not independent change (but both build was from almost same checkout). Bootstrapped/regtested x86_64-linux, OK? (I will analyze more the t2.c failure) Honza gcc/ChangeLog: 2020-11-10 Jan Hubicka * gimple.c: Include ipa-modref-tree.h and ipa-modref.h. (gimple_call_arg_flags): Use modref to determine flags. * ipa-modref.c: Include gimple-ssa.h, tree-phinodes.h, tree-ssa-operands.h, stringpool.h and tree-ssanames.h. (analyze_ssa_name_flags): Declare. (modref_summary::useful_p): Summary is also useful if arg flags are known. (dump_eaf_flags): New function. (modref_summary::dump): Use it. (get_modref_function_summary): Be read for current_function_decl being NULL. (memory_access_to): New function. (deref_flags): New function. (call_lhs_flags): New function. (analyze_parms): New function. (analyze_function): Use it. * ipa-modref.h (struct modref_summary): Add arg_flags. gcc/testsuite/ChangeLog: 2020-11-10 Jan Hubicka * gcc.dg/torture/pta-ptrarith-1.c: Escape parametrs. diff --git a/gcc/gimple.c b/gcc/gimple.c index 1afed88e1f1..da90716aa23 100644 --- a/gcc/gimple.c +++ b/gcc/gimple.c @@ -46,6 +46,8 @@ along with GCC; see the file COPYING3. If not see #include "asan.h" #include "langhooks.h" #include "attr-fnspec.h" +#include "ipa-modref-tree.h" +#include "ipa-modref.h" /* All the tuples have their operand vector (if present) at the very bottom @@ -1532,24 +1534,45 @@ int gimple_call_arg_flags (const gcall *stmt, unsigned arg)
Definition of EAF_NOESCAPE and fnspec strings
Hi, I implemented simple propagation of EAF arguments for ipa-modref (that is not hard to do). My main aim was to detect cases where THIS parameter does not escape but is used to read/write pointed to memory. This is meant to avoid poor code produced when we i.e. offline destructors on cold path. Unfortunately detecting EAF_NOESCAPE for such parameters breaks. This is already visible on memcpy if we let it to be be handled via its fnspec attribute. If I disable special handling in structalias: diff --git a/gcc/tree-ssa-structalias.c b/gcc/tree-ssa-structalias.c index a4832b75436..2b614913b57 100644 --- a/gcc/tree-ssa-structalias.c +++ b/gcc/tree-ssa-structalias.c @@ -4389,7 +4389,6 @@ find_func_aliases_for_builtin_call (struct function *fn, gcall *t) case BUILT_IN_STRCPY: case BUILT_IN_STRNCPY: case BUILT_IN_BCOPY: - case BUILT_IN_MEMCPY: case BUILT_IN_MEMMOVE: case BUILT_IN_MEMPCPY: case BUILT_IN_STPCPY: In the following testcase we miss the fact that memcpy may merge i4p PTA set to i3p. extern void exit (int); extern void abort (void); int size = 8; int main (void) { int i3 = -1, i4 = 55; int *i3p = int *i4p = memcpy(, , size); if (i3p != i4p) abort (); exit (0); } This seems to be documented for some degree: /* Add *tem = nonlocal, do not add *tem = callused as EAF_NOESCAPE parameters do not escape to other parameters and all other uses appear in NONLOCAL as well. */ But it also means that some of our FNSPECs are wrong now. I wonder if we can split this porperty to two different flags like EAF_NOEASCAPE and EAF_INDIRECT_NOESCAPE? Auto-detecting EAF_INDIRECT_NOESCAPE seems bit harder at least assuming that any read can actually load few bits of pointer possibly written there beause if simple member functions accesses values pointer by THIS and return them we lose a track if the returned value is an escape point I would say. The difference in constraints are: --- good/a.c.035t.ealias2020-11-08 13:34:04.397499515 +0100 +++ a.c.035t.ealias 2020-11-08 13:39:15.483438469 +0100 @@ -106,7 +106,15 @@ size = NONLOCAL size.0_1 = size _2 = size.0_1 -i3p = i4p +callarg(18) = +callarg(18) = callarg(18) + UNKNOWN +CALLUSED(16) = callarg(18) +CALLCLOBBERED(17) = callarg(18) +*callarg(18) = NONLOCAL +callarg(19) = +callarg(19) = callarg(19) + UNKNOWN +CALLUSED(16) = callarg(19) +ESCAPED = _2 i3p.1_3 = i3p i4p.2_4 = i4p ESCAPED = ... Call clobber information -ESCAPED, points-to NULL, points-to vars: { } +ESCAPED, points-to non-local, points-to NULL, points-to vars: { } Flow-insensitive points-to information -i3p.1_3, points-to NULL, points-to vars: { D.1947 D.1948 } +i3p.1_3, points-to non-local, points-to escaped, points-to NULL, points-to vars: { D.1947 } i4p.2_4, points-to NULL, points-to vars: { D.1948 } main () Honza
Re: Detect EAF flags in ipa-modref
> > Alias oracle query stats: > refs_may_alias_p: 65808750 disambiguations, 75664890 queries > ref_maybe_used_by_call_p: 153485 disambiguations, 66711204 queries > call_may_clobber_ref_p: 22816 disambiguations, 28889 queries > nonoverlapping_component_refs_p: 0 disambiguations, 36846 queries > nonoverlapping_refs_since_match_p: 27271 disambiguations, 58917 must > overlaps, 86958 queries > aliasing_component_refs_p: 65808 disambiguations, 2067256 queries > TBAA oracle: 25929211 disambiguations 60395141 queries >12391384 are in alias set 0 >10783783 queries asked about the same object >126 queries asked about the same alias set >0 access volatile >9598698 are dependent in the DAG >1691939 are aritificially in conflict with void * > > Modref stats: > modref use: 14284 disambiguations, 53336 queries > modref clobber: 1660281 disambiguations, 2130440 queries > 4311165 tbaa queries (2.023603 per modref query) > 685304 base compares (0.321673 per modref query) > > PTA query stats: > pt_solution_includes: 959190 disambiguations, 13169678 queries > pt_solutions_intersect: 1050969 disambiguations, 13246686 queries I re-run the mainline withtout EAF flag propagation and the results are correct, so it is not due to independent change pushed between my tests. Honza
Re: Detect EAF flags in ipa-modref
> Bootstrapped/regtested x86_64-linux, OK? > (I will analyze more the t2.c failure) I have found independent reproducer that is now in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97775 Honza
Re: Detect EAF flags in ipa-modref
> > Bootstrapped/regtested x86_64-linux, OK? > > (I will analyze more the t2.c failure) > > I have found independent reproducer that is now in > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97775 ... and with Jakub's fix the testcase works now. Honza > > Honza
Re: Detect EAF flags in ipa-modref
Hi, here is updaed patch. Honza Bootstrapped/regtested x86_64-linux, OK (after the fnspec fixes)? 2020-11-10 Jan Hubicka * gimple.c: Include ipa-modref-tree.h and ipa-modref.h. (gimple_call_arg_flags): Use modref to determine flags. * ipa-modref.c: Include gimple-ssa.h, tree-phinodes.h, tree-ssa-operands.h, stringpool.h and tree-ssanames.h. (analyze_ssa_name_flags): Declare. (modref_summary::useful_p): Summary is also useful if arg flags are known. (dump_eaf_flags): New function. (modref_summary::dump): Use it. (get_modref_function_summary): Be read for current_function_decl being NULL. (memory_access_to): New function. (deref_flags): New function. (call_lhs_flags): New function. (analyze_parms): New function. (analyze_function): Use it. * ipa-modref.h (struct modref_summary): Add arg_flags. * doc/invoke.texi (ipa-modref-max-depth): Document. * params.opt (ipa-modref-max-depth): New param. gcc/testsuite/ChangeLog: 2020-11-10 Jan Hubicka * gcc.dg/torture/pta-ptrarith-1.c: Escape parametrs. diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index d2a188d7c75..0bd76d2841e 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -12953,6 +12953,10 @@ memory locations using the mod/ref information. This parameter ought to be bigger than @option{--param ipa-modref-max-bases} and @option{--param ipa-modref-max-refs}. +@item ipa-modref-max-depth +Specified the maximum depth of DFS walk used by modref escape analysis. +Setting to 0 disables the analysis completely. + @item profile-func-internal-id A parameter to control whether to use function internal id in profile database lookup. If the value is 0, the compiler uses an id that diff --git a/gcc/gimple.c b/gcc/gimple.c index 1afed88e1f1..da90716aa23 100644 --- a/gcc/gimple.c +++ b/gcc/gimple.c @@ -46,6 +46,8 @@ along with GCC; see the file COPYING3. If not see #include "asan.h" #include "langhooks.h" #include "attr-fnspec.h" +#include "ipa-modref-tree.h" +#include "ipa-modref.h" /* All the tuples have their operand vector (if present) at the very bottom @@ -1532,24 +1534,45 @@ int gimple_call_arg_flags (const gcall *stmt, unsigned arg) { attr_fnspec fnspec = gimple_call_fnspec (stmt); - - if (!fnspec.known_p ()) -return 0; - int flags = 0; - if (!fnspec.arg_specified_p (arg)) -; - else if (!fnspec.arg_used_p (arg)) -flags = EAF_UNUSED; - else + if (fnspec.known_p ()) { - if (fnspec.arg_direct_p (arg)) - flags |= EAF_DIRECT; - if (fnspec.arg_noescape_p (arg)) - flags |= EAF_NOESCAPE; - if (fnspec.arg_readonly_p (arg)) - flags |= EAF_NOCLOBBER; + if (!fnspec.arg_specified_p (arg)) + ; + else if (!fnspec.arg_used_p (arg)) + flags = EAF_UNUSED; + else + { + if (fnspec.arg_direct_p (arg)) + flags |= EAF_DIRECT; + if (fnspec.arg_noescape_p (arg)) + flags |= EAF_NOESCAPE; + if (fnspec.arg_readonly_p (arg)) + flags |= EAF_NOCLOBBER; + } +} + tree callee = gimple_call_fndecl (stmt); + if (callee) +{ + cgraph_node *node = cgraph_node::get (callee); + modref_summary *summary = node ? get_modref_function_summary (node) + : NULL; + + if (summary && summary->arg_flags.length () > arg) + { + int modref_flags = summary->arg_flags[arg]; + + /* We have possibly optimized out load. Be conservative here. */ + if (!node->binds_to_current_def_p ()) + { + if ((modref_flags & EAF_UNUSED) && !(flags & EAF_UNUSED)) + modref_flags &= ~EAF_UNUSED; + if ((modref_flags & EAF_DIRECT) && !(flags & EAF_DIRECT)) + modref_flags &= ~EAF_DIRECT; + } + flags |= modref_flags; + } } return flags; } diff --git a/gcc/ipa-modref.c b/gcc/ipa-modref.c index 3f46bebed3c..30e76580fb0 100644 --- a/gcc/ipa-modref.c +++ b/gcc/ipa-modref.c @@ -61,6 +61,15 @@ along with GCC; see the file COPYING3. If not see #include "ipa-fnsummary.h" #include "attr-fnspec.h" #include "symtab-clones.h" +#include "gimple-ssa.h" +#include "tree-phinodes.h" +#include "tree-ssa-operands.h" +#include "ssa-iterators.h" +#include "stringpool.h" +#include "tree-ssanames.h" + +static int analyze_ssa_name_flags (tree name, + vec _flags, int depth); /* We record fnspec specifiers for call edges since they depends on actual gimple statements. */ @@ -186,6 +195,8 @@ modref_summary::useful_p (int ecf_flags) { if (ecf_flags & (ECF_CONST | ECF_NOVOPS)) return
Re: Detect EAF flags in ipa-modref
> > + tree callee = gimple_call_fndecl (stmt); > > + if (callee) > > +{ > > + cgraph_node *node = cgraph_node::get (callee); > > + modref_summary *summary = node ? get_modref_function_summary (node) > > + : NULL; > > + > > + if (summary && summary->arg_flags.length () > arg) > > So could we make modref "transform" push this as fnspec attribute or > would that not really be an optimization? It was my original plan to synthetize fnspecs, but I think it is not very good idea: we have the summary readily available and we can represent information that fnspecs can't (do not have artificial limits on number of parameters or counts) I would preffer fnspecs to be used only for in-compiler declarations. > > + > > +/* Analyze EAF flags for SSA name NAME. > > + KNOWN_FLAGS is a cache for flags we already determined. > > + DEPTH is a recursion depth used to make debug output prettier. */ > > + > > +static int > > +analyze_ssa_name_flags (tree name, vec *known_flags, int depth) > > C++ has references which makes the access to known_flags nicer ;) Yay, will chang that :) > > > +{ > > + imm_use_iterator ui; > > + gimple *use_stmt; > > + int flags = EAF_DIRECT | EAF_NOCLOBBER | EAF_NOESCAPE | EAF_UNUSED; > > + > > + /* See if value is already computed. */ > > + if ((*known_flags)[SSA_NAME_VERSION (name)]) > > +{ > > + /* Punt on cycles for now, so we do not need dataflow. */ > > + if ((*known_flags)[SSA_NAME_VERSION (name)] == 1) > > + { > > + if (dump_file) > > + fprintf (dump_file, > > +"%*sGiving up on a cycle in SSA graph\n", depth * 4, ""); > > + return 0; > > + } > > + return (*known_flags)[SSA_NAME_VERSION (name)] - 2; > > +} > > + /* Recursion guard. */ > > + (*known_flags)[SSA_NAME_VERSION (name)] = 1; > > This also guards against multiple evaluations of the same stmts > but only in some cases? Consider > > _1 = ..; > _2 = _1 + _3; > _4 = _1 + _5; > _6 = _2 + _4; > > where we visit _2 = and _4 = from _1 but from both are going > to visit _6. Here we first push _6, then we go for _2 then for _1 evaluate _1, evalueate _2, go for _4 and evaluate _4, and evaluate _6. It is DFS and you need backward edge in DFS (comming from a PHI). Cycles seems to somewhat matter for GCC: we do have a lot of functions that walk linked lists that we could track otherwise. > > Maybe I'm blind but you're not limiting depth? Guess that asks > for problems, esp. as you are recursing rather than using a > worklist or so? > > I see you try to "optimize" the walk by only visiting def->use > links from parameters but then a RPO walk over all stmts would > be simpler iteration-wise ... We usually evaluate just small part of bigger functions (since we lose track quite easily after hitting first memory store). My plan was to change this to actual dataflow once we have it well defined (this means after discussing EAF flags with you and adding the logic to track callsites for true IPA pass that midly complicated things - for every ssa name I track callsite/arg pair where it is passed to either directly or indirectly. Then this is translaed into call summary and used by IPA pass to compute final flags) I guess I can add --param ipa-modref-walk-depth for now and handle dataflow incremntally? In particular I am not sure if I should just write iterated RPO myself or use tree-ssa-propagate.h (the second may be overkill). > > > + if (dump_file) > > +{ > > + fprintf (dump_file, > > + "%*sAnalyzing flags of ssa name: ", depth * 4, ""); > > + print_generic_expr (dump_file, name); > > + fprintf (dump_file, "\n"); > > +} > > + > > + FOR_EACH_IMM_USE_STMT (use_stmt, ui, name) > > +{ > > + if (flags == 0) > > + { > > + BREAK_FROM_IMM_USE_STMT (ui); > > + } > > + if (is_gimple_debug (use_stmt)) > > + continue; > > + if (dump_file) > > + { > > + fprintf (dump_file, "%*s Analyzing stmt:", depth * 4, ""); > > + print_gimple_stmt (dump_file, use_stmt, 0); > > + } > > + > > + /* Gimple return may load the return value. */ > > + if (gimple_code (use_stmt) == GIMPLE_RETURN) > > if (greturn *ret = dyn_cast (use_stmt)) > > makes the as_a below not needed, similar for the other cases (it > also makes accesses cheaper, avoiding some checking code). Looks cleaner indeed. > > > + { > > + if (memory_access_to (gimple_return_retval > > + (as_a (use_stmt)), name)) > > + flags &= ~EAF_UNUSED; > > + } > > + /* Account for LHS store, arg loads and flags from callee function. > > */ > > + else if (is_gimple_call (use_stmt)) > > + { > > + tree callee = gimple_call_fndecl (use_stmt); > > + > > + /* Recursion would require bit of propagation; give up for now. */ > > + if (callee && recursive_call_p (current_function_decl, callee)) > > + flags = 0; > > + else > > +
Re: Detect EAF flags in ipa-modref
> > > > Richi, I think we can add "safe" parameter to gimple_call_arg_flags and > > bypass this logic when we use it for warnings only (having body that > > does not use the value is quite strong hint that it is unused by the > > function). > > Eh, please not. OK, I do not care that much as long as we do not have false positives everywhere :) Hadling EAF_UNUSED and CONST functions is necessary so we do not get false positive caused by us optimizing code out. In this case of not trusing EAF_UNUSED flag we will not optimize, so I do not really care. Martin, we collected very many warnings when building with configure --with-build-config=bootstrap-lto.mk This patch fixes some of them, but there are many others, can you take a look? For the testcase in PR I think it is enough to use default_is_empty_type to disable the warning, but it is not clear to me why the code uses default_is_empty_record at first place. > > > > > I played with bit more testcases and found that we also want to disable > > warning for const functions and sometimes EAF_UNUSED flag is dropped > > becaue of clobber that is not necessary to do. If function only clobber > > the target it can be considered unused past inlining. > > > > I am testing this improved patch and plan to commit if there are no > > complains, but still we need to handle binds_to_current_def. > > > > On the other direction, Martin, I think we may also warn for args > > that are !EAF_UNUSED and !EAF_NOCLOBBER. This will catch some cases > > where user did not add "const" specifier to the declaration but > > parameter is detected to be readonly. > > > > I also noticed that we do not detect EAF_UNUSED for fully unused > > parameters (with no SSA names associated with them). I will fix that > > incrementally. > > Make sure to not apply it based on that reason to aggregates ;) Sure, we already have detection of unused params in ipa-prop, so I guess we want is_giple_ref (parm) && !default_def to imply EAF_UNUSED. Honza
Re: Detect EAF flags in ipa-modref
> On Nov 15 2020, Jan Hubicka wrote: > > >> See PR97840. > > Thanks, > > this is a false positive where we fail to discover that pointed-to type > > is empty on non-x86_64 targets. This is triggered by better alias > > analysis caused by non-escape discovery. > > > > While this is not a full fix (I hope someone with more experience on > > C++ types and warnings will set up) I think it may be useful to avoid > > warning on unused parameter. > > > > Bootstrapped/regtested x86_64-linux, OK? > > That doesn't fix anything. It does silence the warning if you remove inline from function declaration (as creduce while minimizing the testcase - the minimized testcase was undefined that is why I did not include it at the end). I now implemented one by hand. The reason is that gimple_call_arg_flags clears EAF_UNUSED on symbols that !binds_to_current_def_p because we are worried that symbol will be interposed by different version of the body with same semantics that will actually read the arg. This is bit paranoid check since we optimize things like *a == *a early but with clang *a will trap if a==NULL. Richi, I think we can add "safe" parameter to gimple_call_arg_flags and bypass this logic when we use it for warnings only (having body that does not use the value is quite strong hint that it is unused by the function). I played with bit more testcases and found that we also want to disable warning for const functions and sometimes EAF_UNUSED flag is dropped becaue of clobber that is not necessary to do. If function only clobber the target it can be considered unused past inlining. I am testing this improved patch and plan to commit if there are no complains, but still we need to handle binds_to_current_def. On the other direction, Martin, I think we may also warn for args that are !EAF_UNUSED and !EAF_NOCLOBBER. This will catch some cases where user did not add "const" specifier to the declaration but parameter is detected to be readonly. I also noticed that we do not detect EAF_UNUSED for fully unused parameters (with no SSA names associated with them). I will fix that incrementally. Honza PR middle-end/97840 * ipa-modref.c (analyze_ssa_name_flags): Skip clobbers if inlining is done. * tree-ssa-uninit.c (maybe_warn_pass_by_reference): Make stmt gcall; skip const calls and unused arguments. (warn_uninitialized_vars): Update prototype. gcc/testsuite/ChangeLog: 2020-11-16 Jan Hubicka * g++.dg/warn/unit-2.C: New test. diff --git a/gcc/ipa-modref.c b/gcc/ipa-modref.c index 4a43c50aa66..08fcc36a321 100644 --- a/gcc/ipa-modref.c +++ b/gcc/ipa-modref.c @@ -1333,7 +1331,14 @@ analyze_ssa_name_flags (tree name, vec _flags, int depth) /* Handle *name = exp. */ else if (assign && memory_access_to (gimple_assign_lhs (assign), name)) - flags &= ~(EAF_UNUSED | EAF_NOCLOBBER); + { + /* In general we can not ignore clobbers because they are +barriers for code motion, however after inlining it is safe to +do because local optimization passes do not consider clobbers +from other functions. Similar logic is in ipa-pure-const.c. */ + if (!cfun->after_inlining && !gimple_clobber_p (assign)) + flags &= ~(EAF_UNUSED | EAF_NOCLOBBER); + } /* ASM statements etc. */ else if (!assign) { diff --git a/gcc/testsuite/g++.dg/warn/unit-2.C b/gcc/testsuite/g++.dg/warn/unit-2.C new file mode 100644 index 000..30f3ceae191 --- /dev/null +++ b/gcc/testsuite/g++.dg/warn/unit-2.C @@ -0,0 +1,29 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -Wmaybe-uninitialized" } */ +struct a {int a;}; +__attribute__ ((noinline)) +void +nowarn (const struct a *ptr) +{ + if (ptr) +asm volatile (""); +} +void +test() +{ + struct a ptr; + nowarn (); +} +__attribute__ ((noinline)) +int +nowarn2 (const struct a *ptr, const struct a ptr2) +{ + return ptr != 0 || ptr2.a; +} +int mem; +int +test2() +{ + struct a ptr,ptr2={0}; + return nowarn2 (, ptr2); +} diff --git a/gcc/tree-ssa-uninit.c b/gcc/tree-ssa-uninit.c index f23514395e0..c94831bfb75 100644 --- a/gcc/tree-ssa-uninit.c +++ b/gcc/tree-ssa-uninit.c @@ -443,7 +443,7 @@ maybe_warn_operand (ao_ref , gimple *stmt, tree lhs, tree rhs, access implying read access to those objects. */ static void -maybe_warn_pass_by_reference (gimple *stmt, wlimits ) +maybe_warn_pass_by_reference (gcall *stmt, wlimits ) { if (!wlims.wmaybe_uninit) return; @@ -457,6 +457,10 @@ maybe_warn_pass_by_reference (gimple *stmt, wlimits ) if (!fntype) return; + /* Const function do not read their arguments. */ + if (gimple_call_flags (stmt) & ECF_CONST) +return; + const built_in_
Re: Detect EAF flags in ipa-modref
> On 11/16/20 1:44 PM, Jan Hubicka wrote: > > Martin, we collected very many warnings when building with > > configure --with-build-config=bootstrap-lto.mk > > This patch fixes some of them, but there are many others, can you take a > > look? > > Hello. > > I guess you mean Martin Jambor, or me? Please CC :) Martin Sebor, added to CC (since he did most of work on tree-ssa-uninit recently) I filled some PRs on the isues now. Honza > > Martin
Re: Definition of EAF_NOESCAPE and fnspec strings
> > But it also means that some of our FNSPECs are wrong now. I wonder if we > > can > > split this porperty to two different flags like EAF_NOEASCAPE and > > EAF_INDIRECT_NOESCAPE? > > Note that EAF_NOESCAPE allows "escaping" to the return value (see > handle_rhs_call). I guess for simplicity we could allow escaping I see, missed that! > of not escaped but USED params to other written to params. I think > we also don't handle "escaping" of a parameter indirectly to itself, thus > > int i; > int *p = > int **q = > foo (); > if (*q != i) > abort (); > > and > > foo (int ***p) > { > *p = **p; > } > > or so with foos param EAF_NOESCAPE (but not EAF_DIRECT). > > Splitting up EAF_NOESCAPE makes it quite difficult to understand. > Arguably explicit handling of memcpy and friends _does_ pay off > for points-to analysis since I'd say modelling all possible and useful > things in fnspec would make it a monster ... you'd basically want > to have a way to specify additional constraints in the fnspec itself, > like *1 = *2, but then also distinguish points-to effects from > may-alias ones. Yep, i am not arguing for eliminating special case of memcpy (because we have the additional info that it only copies pointers from *src to *dest). However I find current definition of EAF_NOESCAPE bit hard to handle in modref, since naturally it is quite reliable to track all uses of ssa name that correspond to parameters, but it is harder to track where values read from pointed-to memory can eventually go. What I do is to walk all uses of SSA_NAME that correspond to parameter and try to unerstand it. If it is one of 1) memory load via derefence of name 2) memory store where name is base of LHS 3) memory store where name is rhs 4) name is passed as a value to function 5) dereferenced value from name is passed to funtion 6) used as value normal gimple expression 7) used in return (as RHS or base of memory dereference address) 8) it is used only in reference but not as base (as array index or so) Then I merge in (and) flags I determine as follow: For 1) clear EAF_USED and recurse to LHS name. Based on its flags I decide on: - EAF_DIRECT (if LHS has EAF_UNUSED), - EAF_NOCLOBBER (if LHS has EAF_NOCLOBBER) - EAF_NOESCAPE (if LHS has EAF_NOESCAPE). For 2) I clear EAF_NOCLOBBER and EAF_UNUSED flag For 3) I give up (clear all flags) since I would need to track where the memory is going. For 4) I determine flag of function parameter For 5) I need to do same handling as 1) where flag of "loaded value" is flag of the function For 6) I determine flags of LHS and merge them in For 7) I clear NOESCAPE if rhs is name itself and UNUSED + NOESCAPE if rhs is derefernece from name. For 8) I do nothing. Here the names are non-pointers that I track because of earlier dereference. So I think 7) can be relaxed. Main problem is hoever that we often see 1) and then 3) or 7) on LHS that makes us punt very often. The fact that pointer directly does not escape but pointed to memory can seems still very useful since one does not need to add *ptr to points-to sets. But I will try relaxing 7). If we allow values escaping to other parameters and itself, I think I can relax 3) if base of the store is default def of PARM_DECL. > > I wonder if we should teach the GIMPLE FE to parse 'fn spec' > so we can write unit tests for the attribute ... or maybe simply > add this to the __GIMPLE spec string. May be nice and also describe carefully that NOESCAPE and NOCLOBBER also reffers to indirect references. Current description "Nonzero if the argument does not escape." reads to me that it is about ptr itself, not about *ptr and also it does not speak of the escaping to return value etc. Honza
Detect EAF flags in ipa-modref
> > > > Yep, i am not arguing for eliminating special case of memcpy (because we > > have the additional info that it only copies pointers from *src to > > *dest). > > > > However I find current definition of EAF_NOESCAPE bit hard to handle in > > modref, since naturally it is quite reliable to track all uses of ssa > > name that correspond to parameters, but it is harder to track where > > values read from pointed-to memory can eventually go. > > Yeah - also the fnspec of memcpy _is_ wrong at the moment ... Yep. > > > > For 6) I determine flags of LHS and merge them in > > guess you could track a SSA lattice "based on parameter N" which > would need to be a mask (thus only track up to 32 parameters?) Yes, I can do that if we relax NOESCAPE this way. > > > For 7) I clear NOESCAPE if rhs is name itself > > and UNUSED + NOESCAPE if rhs is derefernece from name. > > > > For 8) I do nothing. Here the names are non-pointers that I track > > because of earlier dereference. > > > > > > > > So I think 7) can be relaxed. Main problem is hoever that we often see 1) > > and then 3) or 7) on LHS that makes us punt very often. > > > > The fact that pointer directly does not escape but pointed to memory can > > seems still very useful since one does not need to add *ptr to points-to > > sets. But I will try relaxing 7). > > > > If we allow values escaping to other parameters and itself, I think I > > can relax 3) if base of the store is default def of PARM_DECL. > > I think the important part is to get things correct. Maybe it's worth Indeed :) > to add write/read flags where the argument _does_ escape in case the > function itself is otherwise pure/const. For PTA that doesn't make > a difference (and fnspec was all about PTA ...) but for alias-analysis > it does. I detect them independently (UNUSED/NOCLOBBER flags which is not perfect since we do not have OUTPUT flag like in fnspecs), but currently they are unused since we do not track "pure/const except for known exceptions". This is not hard to add. > > > > > > > I wonder if we should teach the GIMPLE FE to parse 'fn spec' > > > so we can write unit tests for the attribute ... or maybe simply > > > add this to the __GIMPLE spec string. > > > > May be nice and also describe carefully that NOESCAPE and NOCLOBBER also > > reffers to indirect references. Current description > > "Nonzero if the argument does not escape." > > reads to me that it is about ptr itself, not about *ptr and also it does > > not speak of the escaping to return value etc. > > Well, if 'ptr' escapes then obvoiously all memory reachable from 'ptr' > escapes - escaping is always transitive. Yes, but if values pointed to by ptr escapes, ptr itself does not need to escape. This is easy to detect (and is common case of THIS pointer) but we have no way to express it via EAF flags. > > And as escaping is in the context of the caller sth escaping to the > caller itself (via return) can hardly be considered escaping (again > this was designed for PTA ...). > > I guess it makes sense to be able to separate escaping from the rest. I think current definition (escaping via return is not an escape) is also OK for modref propagation. We may have EAF_NORETURN that says that value never escapes to return value that would be also easy to detect. This is kind of minimal patch for the EAF flags discovery. It works only in local ipa-modref and gives up on cyclic SSA graphs. Adding propagation is easy and proper IPA mode needs collecting call sites+arg pairs that affect the answer. It passes testuite except for sso/t2.c testcase where it affects bit of first dumped structure R1. We correctly determine that it is noescape/noclobber from the dump function and it seems that this triggers kind of strange SRA. I will look into it. I am running full bootstrap/regtest. On tramp3d the effect is not great, but it does something. PTA query stats: pt_solution_includes: 397269 disambiguations, 606922 queries pt_solutions_intersect: 138302 disambiguations, 416884 queries to PTA query stats: pt_solution_includes: 401540 disambiguations, 609093 queries pt_solutions_intersect: 138616 disambiguations, 417174 queries 2020-11-09 Jan Hubicka * builtins.c (builtin_fnspec): Fix fnspecs of memcpy and friends. * gimple.c: Include ipa-modref-tree.h and ipa-mdoref.h. (gimple_call_arg_flags): Use modref to determine flags. * ipa-modref.c: Include gimple-ssa.h, tree-phinodes.h, tree-ssa-operands.h, stringpool.h and tree-ssanames.h. (modref_summary::useful_p): Summary is also useful if EAF flags are known
Re: Detect EAF flags in ipa-modref
> This breaks bootstrap with go. > > ../../gcc/go/gofrontend/go-diagnostics.cc: In function 'std::string > expand_message(const char*, va_list)': > ../../gcc/go/gofrontend/go-diagnostics.cc:110:61: error: '' may be > used uninitialized [-Werror=maybe-uninitialized] > 110 | "memory allocation failed in vasprintf"); > | ^ Well, this may be simply a false positive (except that the warning is clearly mislocalted and cryptic) I was bootstrapping with go just few minutes ago, so I wonder what configure flags you use? Honza > > Andreas. > > -- > Andreas Schwab, sch...@linux-m68k.org > GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1 > "And now for something completely different."
Re: Detect EAF flags in ipa-modref
> On Fri, Nov 13, 2020 at 12:07 AM Richard Biener wrote: > > > > On Tue, 10 Nov 2020, Jan Hubicka wrote: > > > > > Hi, > > > here is updaed patch. > > > > > > Honza > > > > > > Bootstrapped/regtested x86_64-linux, OK (after the fnspec fixes)? > > > > OK. > > > > Thanks, > > Richard. > > > > This caused: > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97836 I have pushed the following fix as obvious. The problem is caused by fact that I made ipa-modref to set EAF_UNUSED flag for parameters that are used as scalars but not used to read memory since that suits the use of the flag in tree-ssa-alias and mostly tree-ssa-structalias with exception of the rule of escaping to return value. At monday I will disuss with Richard if we want to adjust EAF_UNUSED or invernt EAF_NOREAD, but this should prevent wrong code issues. Honza 2020-11-15 Jan Hubicka PR ipa/97836 * ipa-modref.c (analyze_ssa_name_flags): Make return to clear EAF_UNUSED flag. gcc/testsuite/ChangeLog: 2020-11-15 Jan Hubicka * gcc.c-torture/execute/pr97836.c: New test. diff --git a/gcc/ipa-modref.c b/gcc/ipa-modref.c index 5273c200f00..4a43c50aa66 100644 --- a/gcc/ipa-modref.c +++ b/gcc/ipa-modref.c @@ -1224,10 +1224,12 @@ analyze_ssa_name_flags (tree name, vec _flags, int depth) print_gimple_stmt (dump_file, use_stmt, 0); } - /* Gimple return may load the return value. */ + /* Gimple return may load the return value. +Returning name counts as an use by tree-ssa-structalias.c */ if (greturn *ret = dyn_cast (use_stmt)) { - if (memory_access_to (gimple_return_retval (ret), name)) + if (memory_access_to (gimple_return_retval (ret), name) + || name == gimple_return_retval (ret)) flags &= ~EAF_UNUSED; } /* Account for LHS store, arg loads and flags from callee function. */ diff --git a/gcc/testsuite/gcc.c-torture/execute/pr97836.c b/gcc/testsuite/gcc.c-torture/execute/pr97836.c new file mode 100644 index 000..4585e1fc69d --- /dev/null +++ b/gcc/testsuite/gcc.c-torture/execute/pr97836.c @@ -0,0 +1,17 @@ +int a; + +int b(int c) { return 0; } + +static int *d(int *e) { + if (a) { +a = a && b(*e); + } + return e; +} + +int main() { + int f; + if (d() != ) +__builtin_abort(); + return 0; +}
Re: Detect EAF flags in ipa-modref
> Hi Jan, > > >> This breaks bootstrap with go. > >> > >> ../../gcc/go/gofrontend/go-diagnostics.cc: In function 'std::string > >> expand_message(const char*, va_list)': > >> ../../gcc/go/gofrontend/go-diagnostics.cc:110:61: error: '' may > >> be used uninitialized [-Werror=maybe-uninitialized] > >> 110 | "memory allocation failed in vasprintf"); > >> | ^ > > > > Well, this may be simply a false positive (except that the warning is > > clearly mislocalted and cryptic) I was bootstrapping with go just few > > minutes ago, so I wonder what configure flags you use? > > I'm seeing the same on both i386-pc-solaris2.11 and > sparc-sun-solaris2.11, so I don't think there are special configure > flags involved. When I configure with ../configure --enable-languages=go my build eventually finishes fine on curren trunk: /aux/hubicka/trunk-git/build-go/./gcc/gccgo -B/aux/hubicka/trunk-git/build-go/./gcc/ -B/usr/local/x86_64-pc-linux-gnu/bin/ -B/usr/local/x86_64-pc-linux-gnu/lib/ -isystem /usr/local/x86_64-pc-linux-gnu/include -isystem /usr/local/x86_64-pc-linux-gnu/sys-include -g -O2 -I ../x86_64-pc-linux-gnu/libgo -static-libstdc++ -static-libgcc -L ../x86_64-pc-linux-gnu/libgo -L ../x86_64-pc-linux-gnu/libgo/.libs -o cgo ../../gotools/../libgo/go/cmd/cgo/ast.go ../../gotools/../libgo/go/cmd/cgo/doc.go ../../gotools/../libgo/go/cmd/cgo/gcc.go ../../gotools/../libgo/go/cmd/cgo/godefs.go ../../gotools/../libgo/go/cmd/cgo/main.go ../../gotools/../libgo/go/cmd/cgo/out.go ../../gotools/../libgo/go/cmd/cgo/util.go zdefaultcc.go ../x86_64-pc-linux-gnu/libgo/libgotool.a make[2]: Leaving directory '/aux/hubicka/trunk-git/build-go/gotools' make[1]: Leaving directory '/aux/hubicka/trunk-git/build-go' On Debian SLES machines. Having preprocessed go-diagnostics.cc and .uninit1 dump would probably help. These are often false positives, but I saw similar warnings caused by wrong EAF_UNUSED flag discovery which led to init code being optimized out before inlining, so I would like to look into the dumps and figure out if that is uninit acting funny or there is some real problem. Honza > > Rainer > > -- > - > Rainer Orth, Center for Biotechnology, Bielefeld University
Re: Detect EAF flags in ipa-modref
> Hi Jan, > > >> I'm seeing the same on both i386-pc-solaris2.11 and > >> sparc-sun-solaris2.11, so I don't think there are special configure > >> flags involved. > > > > When I configure with ../configure --enable-languages=go my build > > eventually finishes fine on curren trunk: > [...] > > On Debian SLES machines. Having preprocessed go-diagnostics.cc and > > .uninit1 dump would probably help. > > apart from go/diagnostics.cc, I see the warning several times in > libstdc++, but only for the 32-bit multilib. > > Try building either for i686-pc-linux-gnu or a bi-arch > x86_64-pc-linux-gnu build. My build is bi-arch. I will check libstdc++ warnings, but we had some such warnings previously there too (plus we have tons of uninitialized warnings with LTO bootstrap). This really may depend on glibc headers and how string functions get expanded. This is why I think having preprocessed diagnotics.cc should help. Honza > > Rainer > > -- > - > Rainer Orth, Center for Biotechnology, Bielefeld University
Re: Detect EAF flags in ipa-modref
> See PR97840. Thanks, this is a false positive where we fail to discover that pointed-to type is empty on non-x86_64 targets. This is triggered by better alias analysis caused by non-escape discovery. While this is not a full fix (I hope someone with more experience on C++ types and warnings will set up) I think it may be useful to avoid warning on unused parameter. Bootstrapped/regtested x86_64-linux, OK? PR middle-end/97840 * tree-ssa-uninit.c (maybe_warn_pass_by_reference): Update prototype; silence warning on EAF_UNUSED parameters. (warn_uninitialized_vars): Update. diff --git a/gcc/tree-ssa-uninit.c b/gcc/tree-ssa-uninit.c index f23514395e0..1e074793b02 100644 --- a/gcc/tree-ssa-uninit.c +++ b/gcc/tree-ssa-uninit.c @@ -443,7 +443,7 @@ maybe_warn_operand (ao_ref , gimple *stmt, tree lhs, tree rhs, access implying read access to those objects. */ static void -maybe_warn_pass_by_reference (gimple *stmt, wlimits ) +maybe_warn_pass_by_reference (gcall *stmt, wlimits ) { if (!wlims.wmaybe_uninit) return; @@ -501,6 +501,10 @@ maybe_warn_pass_by_reference (gimple *stmt, wlimits ) && !TYPE_READONLY (TREE_TYPE (argtype))) continue; + /* Ignore args we are not going to read from. */ + if (gimple_call_arg_flags (stmt, argno - 1) & EAF_UNUSED) + continue; + if (save_always_executed && access->mode == access_read_only) /* Attribute read_only arguments imply read access. */ wlims.always_executed = true; @@ -639,8 +643,8 @@ warn_uninitialized_vars (bool wmaybe_uninit) if (gimple_vdef (stmt)) wlims.vdef_cnt++; - if (is_gimple_call (stmt)) - maybe_warn_pass_by_reference (stmt, wlims); + if (gcall *call = dyn_cast (stmt)) + maybe_warn_pass_by_reference (call, wlims); else if (gimple_assign_load_p (stmt) && gimple_has_location (stmt)) {
Re: Where did my function go?
> On Wed, Oct 21, 2020 at 5:21 AM Gary Oblock wrote: > > > > >IPA transforms happens when get_body is called. With LTO this also > > >trigger reading the body from disk. So if you want to see all bodies > > >and work on them, you can simply call get_body on everything but it will > > >result in increased memory use since everything will be loaded form disk > > >and expanded (by inlining) at once instead of doing it on per-function > > >basis. > > Jan, > > > > Doing > > > > FOR_EACH_FUNCTION_WITH_GIMPLE_BODY ( node) node->get_body (); > > > > instead of > > > > FOR_EACH_FUNCTION_WITH_GIMPLE_BODY ( node) node->get_untransformed_body (); > > > > instantaneously breaks everything... > > I think during WPA you cannot do ->get_body (), only > ->get_untransformed_body (). But > we don't know yet where in the IPA process you're experiencing the issue. Originally get_body is designed to work in WPA as well: the info about what transforms are to be applied is kept in a vector with per-function granuality. But there may be some issues as this path is untested and i.e ipa-sra/ipa-prop does quite difficult transformations these days. What happens? Honza > > Richard. > > > Am I missing something? > > > > Gary > > > > From: Jan Hubicka > > Sent: Tuesday, October 20, 2020 4:34 AM > > To: Richard Biener > > Cc: GCC Development ; Gary Oblock > > > > Subject: Re: Where did my function go? > > > > [EXTERNAL EMAIL NOTICE: This email originated from an external sender. > > Please be mindful of safe email handling and proprietary information > > protection practices.] > > > > > > > > On Tue, Oct 20, 2020 at 1:02 PM Martin Jambor wrote: > > > > > > > > > > Hi, > > > > > > > > > > On Tue, Oct 20 2020, Richard Biener wrote: > > > > > > On Mon, Oct 19, 2020 at 7:52 PM Gary Oblock > > > > > > wrote: > > > > > >> > > > > > >> Richard, > > > > > >> > > > > > >> I guess that will work for me. However, since it > > > > > >> was decided to remove an identical function, > > > > > >> why weren't the calls to it adjusted to reflect it? > > > > > >> If the call wasn't transformed that means it will > > > > > >> be mapped at some later time. Is that mapping > > > > > >> available to look at? Because using that would > > > > > >> also be a potential solution (assuming call > > > > > >> graph information exists for the deleted function.) > > > > > > > > > > > > I'm not sure how the transitional cgraph looks like > > > > > > during WPA analysis (which is what we're talking about?), > > > > > > but definitely the IL is unmodified in that state. > > > > > > > > > > > > Maybe Martin has an idea. > > > > > > > > > > > > > > > > Exactly, the cgraph_edges is where the correct call information is > > > > > stored until the inlining transformation phase calls > > > > > cgraph_edge::redirect_call_stmt_to_callee is called on it - inlining > > > > > is > > > > > a special pass in this regard that performs this IPA-infrastructure > > > > > function in addition to actual inlining. > > > > > > > > > > In cgraph means the callee itself but also information in > > > > > e->callee->clone.param_adjustments which might be interesting for any > > > > > struct-reorg-like optimizations (...and in future possibly in other > > > > > transformation summaries). > > > > > > > > > > The late IPA passes are in very unfortunate spot here since they run > > > > > before the real-IPA transformation phases but after unreachable node > > > > > removals and after clone materializations and so can see some but not > > > > > all of the changes performed by real IPA passes. The reason for that > > > > > is > > > > > good cache locality when late IPA passes are either not run at all or > > > > > only look at small portion of the compilation unit. In such case IPA > > > > > transformations of a function are followed by all the late passes > > > > > working on the same function. > >
Re: State of AutoFDO in GCC
> > > With my tests, AutoFDO could achieve almost half of the effect of > > instrumentation FDO on real applications such as MySQL 8.0.20 . > > Likely this could be improved with some of the missing changes. Apparently > discriminator support is worth quite a bit especially on dense C++ code > bases. Without that, gcc autofdo only works on line numbers, which can be > very limiting if single lines have a lot of basic blocks. > > Sadly discriminator support is currently only on the old Google branch and > not in gcc mainline > > Longer term it would probably be best to replace all this with some custom > specialized binary annotation instead of stretching DWARF beyond its limits. I think it makes sense to pick the AutoFDO to the lists of things that would be nice to fix in GCC12. I guess first we need to solve the issues with the tool producing autofdo gcda files and once that works setup some benchmarking so we know how things compare to FDO and they get tested. If you point me to the discriminator patches I can try to figure out how hard would be to mainline them. I am not too sure about custom annotations though - storing info about regions of code segment is always a pain and it is quite nice that dwarf provides support for that. But I guess we could go step by step. I first need a working setup ;) Honza > > -Andi >
Re: [PATCH] Try LTO partial linking. (Was: Speed of compiling gimple-match.c)
> On Thu, May 20, 2021 at 3:16 PM Richard Biener > wrote: > > > > On Thu, May 20, 2021 at 3:06 PM Martin Liška wrote: > > > > > > On 5/20/21 2:54 PM, Richard Biener wrote: > > > > So why did you go from applying this per-file to multiple files? > > > > > > When I did per-file for {gimple,generic}-match.c I hit the following > > > issue with lto.priv symbols: > > > > > > /usr/lib64/gcc/x86_64-suse-linux/10/../../../../x86_64-suse-linux/bin/ld: > > > error: libbackend.a(generic-match.o): multiple definition of > > > 'wi::to_wide(tree_node const*) [clone .part.0] [clone .lto_priv.0]' > > > /usr/lib64/gcc/x86_64-suse-linux/10/../../../../x86_64-suse-linux/bin/ld: > > > libbackend.a(gimple-match.o): previous definition here > > > /usr/lib64/gcc/x86_64-suse-linux/10/../../../../x86_64-suse-linux/bin/ld: > > > error: libbackend.a(generic-match.o): multiple definition of > > > 'TYPE_VECTOR_SUBPARTS(tree_node const*) [clone .part.0] [clone > > > .lto_priv.0]' > > > /usr/lib64/gcc/x86_64-suse-linux/10/../../../../x86_64-suse-linux/bin/ld: > > > libbackend.a(gimple-match.o): previous definition here > > > /usr/lib64/gcc/x86_64-suse-linux/10/../../../../x86_64-suse-linux/bin/ld: > > > error: libbackend.a(generic-match.o): multiple definition of > > > 'vec::operator[](unsigned int) [clone > > > .part.0] [clone .lto_priv.0]' > > > > > > Any idea what was I doing wrong? > > > > Nothing in particular I think - you're just hitting the issue that LTO > > produces new symbols and that those > > can obviously clash. Giuliano hit the very same issue. When not > > doing partial links those internal > > symbols pose no problem, but with -r -flinker-output=nolto-rel and > > re-linking the produced objects > > they obviously do. ELF has no solution for this though, but I think > > we could strip those from the > > partially linked object - if WPA would give us a list of objects the > > link step could postprocess > > the object with objcopy or maybe a custom linker script could do the > > trick as well. > > Oh, and the "best" solution would be to avoid involving the linker > when doing -r -flinker-output=nolto-rel but instead have the assembler > produce the single object from the multiple LTRANS assembly snippets > which could then use local labels instead of symbols for these. Quick solution is to also modify partitioner to use the local symbol names when doing incremental linking (those mixing in source code and random seeds) to avoid clashes. Honza > > > So your workaround is to only ever have a single LTO produced object > > file participating in the > > final links ;) > > > > Richard. > > > > > > > > Martin
Re: State of AutoFDO in GCC
> On Fri, Apr 23, 2021 at 12:18 AM Martin Liška wrote: > > > On 4/23/21 9:00 AM, Richard Biener via Gcc wrote: > > > On Fri, Apr 23, 2021 at 7:28 AM Xinliang David Li via Gcc > > > wrote: > > >> > > >> Hi, the create_gcov tool was probably removed with the assumption that > > it > > >> was only used with Google GCC branch, but it is actually used with GCC > > >> trunk as well. > > >> > > >> Given that, the tool will be restored in the github repo. It seems to > > build > > >> and work fine with the regression test. > > >> > > >> The tool may ust work as it is right now, but there is no guarantee it > > >> won't break in the future unless someone in the GCC community tries to > > >> maintain it. > > > > Hi. > > > > The current situation is that AutoFDO doesn't work with pretty simple > > test-cases > > we have in testsuite: > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71672 > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81379 > > > > These are ~5 years old and nothing has happened. Basic functionality is to identify hot parts of programs and give basic idea of branch probabilities. Bugs above are about other optimizations (indirect inlining, hot-cold partitioning and peeling). First one ought to be working but still indirect call profiling accounts for relatively minor portion of FDO benefits, so auto-FDO should be usefable even w/o these working if tooling was fixed. Expecting hot-cold partitioning and peeling to work reliably for testcases designed for normalFDO is IMO bit unrealistic. Perf is not limited to x86, so we should get basic usability for all major architectures. David, what CPUs are supported with auto-FDO on LLVM side? > > This is a major feature in Clang. Deprecating it in GCC will be unfortunate. I also like overall idea of auto-FDO just never had much time to get it into shape. Every stage1 I am trying to fix something that was broken for a while (like ICF last stage1) and fix it, so option would be to focus on autoFDO this stage1 even though I have quite few other things in TODO list. I never looked into the tools translating perf data to gcc compatible format and how easy would be to keep this alive. David, how does perf->LLVM path work these days? Honza > > David > > > > > Thoughts? > > Martin > > > > > > > > Having the tool third-party makes keeping the whole chain working more > > > difficult. > > > > > > Richard. > > > > > >> Thanks, > > >> > > >> David > > >> > > >> On Thu, Apr 22, 2021 at 3:29 PM Jan Hubicka wrote: > > >> > > >>>> On 4/22/21 9:58 PM, Eugene Rozenfeld via Gcc wrote: > > >>>>> GCC documentation for AutoFDO points to create_gcov tool that > > converts > > >>> perf.data file into gcov format that can be consumed by gcc with > > >>> -fauto-profile ( > > https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html, > > >>> https://gcc.gnu.org/wiki/AutoFDO/Tutorial). > > >>>>> > > >>>>> I noticed that the source code for create_gcov has been deleted from > > >>> https://github.com/google/autofdo on April 7. I asked about that > > change > > >>> in that repo and got the following reply: > > >>>>> > > >>>>> https://github.com/google/autofdo/pull/107#issuecomment-819108738 > > >>>>> > > >>>>> "Actually we didn't use create_gcov and havn't updated create_gcov > > for > > >>> years, and we also didn't have enough tests to guarantee it works (It > > was > > >>> gcc-4.8 when we used and verified create_gcov). If you need it, it is > > >>> welcomed to update create_gcov and add it to the respository." > > >>>>> > > >>>>> Does this mean that AutoFDO is currently dead in gcc? > > >>>> > > >>>> Hello. > > >>>> > > >>>> Yes. I know that even basic test cases have been broken for years in > > the > > >>> GCC. > > >>>> It's new to me that create_gcov was removed. > > >>>> > > >>>> I tend to send patch to GCC that will remove AutoFDO from GCC. > > >>>> I known Bin spent some time working on AutoFDO, has he came up to > > >>> something? > > >>> > > >>> The GCC side of auto-FDO is not that hard. We have most of > > >>> infrastructure in place, but stopping point for me was always > > difficulty > > >>> to get gcov-tool working. If some maintainer steps up, I think I can > > >>> fix GCC side. > > >>> > > >>> I am bit unsure how important feature it is - we have FDO that works > > >>> quite well for most users but I know there are some users of the LLVM > > >>> implementation and there is potential to tie this with other hardware > > >>> events to asist i.e. if conversion (where one wants to know how well > > CPU > > >>> predicts the jump rather than just the jump probability) which I always > > >>> found potentially interesting. > > >>> > > >>> Honza > > >>>> > > >>>> Martin > > >>>> > > >>>>> > > >>>>> Thanks, > > >>>>> > > >>>>> Eugene > > >>>>> > > >>>> > > >>> > > > >
Re: State of AutoFDO in GCC
> > It uses create_llvm_prof tool which is in the same git repo. The data > parsing part is shared with create_gcov, but the writer is obviously > different for the two tools. OK and what are the main differences between llvmand gcc format? Honza > > David > > > > Honza > > > > > > David > > > > > > > > > > > Thoughts? > > > > Martin > > > > > > > > > > > > > > Having the tool third-party makes keeping the whole chain working > > more > > > > > difficult. > > > > > > > > > > Richard. > > > > > > > > > >> Thanks, > > > > >> > > > > >> David > > > > >> > > > > >> On Thu, Apr 22, 2021 at 3:29 PM Jan Hubicka wrote: > > > > >> > > > > >>>> On 4/22/21 9:58 PM, Eugene Rozenfeld via Gcc wrote: > > > > >>>>> GCC documentation for AutoFDO points to create_gcov tool that > > > > converts > > > > >>> perf.data file into gcov format that can be consumed by gcc with > > > > >>> -fauto-profile ( > > > > https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html, > > > > >>> https://gcc.gnu.org/wiki/AutoFDO/Tutorial). > > > > >>>>> > > > > >>>>> I noticed that the source code for create_gcov has been deleted > > from > > > > >>> https://github.com/google/autofdo on April 7. I asked about that > > > > change > > > > >>> in that repo and got the following reply: > > > > >>>>> > > > > >>>>> > > https://github.com/google/autofdo/pull/107#issuecomment-819108738 > > > > >>>>> > > > > >>>>> "Actually we didn't use create_gcov and havn't updated > > create_gcov > > > > for > > > > >>> years, and we also didn't have enough tests to guarantee it works > > (It > > > > was > > > > >>> gcc-4.8 when we used and verified create_gcov). If you need it, it > > is > > > > >>> welcomed to update create_gcov and add it to the respository." > > > > >>>>> > > > > >>>>> Does this mean that AutoFDO is currently dead in gcc? > > > > >>>> > > > > >>>> Hello. > > > > >>>> > > > > >>>> Yes. I know that even basic test cases have been broken for years > > in > > > > the > > > > >>> GCC. > > > > >>>> It's new to me that create_gcov was removed. > > > > >>>> > > > > >>>> I tend to send patch to GCC that will remove AutoFDO from GCC. > > > > >>>> I known Bin spent some time working on AutoFDO, has he came up to > > > > >>> something? > > > > >>> > > > > >>> The GCC side of auto-FDO is not that hard. We have most of > > > > >>> infrastructure in place, but stopping point for me was always > > > > difficulty > > > > >>> to get gcov-tool working. If some maintainer steps up, I think I > > can > > > > >>> fix GCC side. > > > > >>> > > > > >>> I am bit unsure how important feature it is - we have FDO that > > works > > > > >>> quite well for most users but I know there are some users of the > > LLVM > > > > >>> implementation and there is potential to tie this with other > > hardware > > > > >>> events to asist i.e. if conversion (where one wants to know how > > well > > > > CPU > > > > >>> predicts the jump rather than just the jump probability) which I > > always > > > > >>> found potentially interesting. > > > > >>> > > > > >>> Honza > > > > >>>> > > > > >>>> Martin > > > > >>>> > > > > >>>>> > > > > >>>>> Thanks, > > > > >>>>> > > > > >>>>> Eugene > > > > >>>>> > > > > >>>> > > > > >>> > > > > > > > > > >
Re: State of AutoFDO in GCC
> Hi. > > The current situation is that AutoFDO doesn't work with pretty simple > test-cases > we have in testsuite: > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71672 > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81379 > > These are ~5 years old and nothing has happened. > > I'm pretty sure the current autofdo can't emit a .gcda file format that > I've changed in the recent years. I believe that the gcda files consumed by auto-FDO are not really sharing the normal gcov file format, just the low level i/o. See read_profile in auto-profile.c. Honza > > > > > I think if we want to keep the feature it makes sense to provide create_gcov > > functionality either directly from perf (input data producer) or from gcc > > (data consumer). Of course I have no idea about its complexity, license > > or implementation language ... > > For me, it's just an i386 feature (maybe aarch64 has perf counters too?), > supported > only by vendor (Intel) and I'm not planning working on that. > I don't like having a feature that is obviously broken and potential GCC > users get > bad experience every time they try to use it. > > Can we at least deprecate the feature for GCC 11? If these is enough interest, > we can fix it, if not, I would remove it in GCC 13 timeframe. > > Thoughts? > Martin > > > > > Having the tool third-party makes keeping the whole chain working more > > difficult. > > > > Richard. > > > >> Thanks, > >> > >> David > >> > >> On Thu, Apr 22, 2021 at 3:29 PM Jan Hubicka wrote: > >> > >>>> On 4/22/21 9:58 PM, Eugene Rozenfeld via Gcc wrote: > >>>>> GCC documentation for AutoFDO points to create_gcov tool that converts > >>> perf.data file into gcov format that can be consumed by gcc with > >>> -fauto-profile (https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html, > >>> https://gcc.gnu.org/wiki/AutoFDO/Tutorial). > >>>>> > >>>>> I noticed that the source code for create_gcov has been deleted from > >>> https://github.com/google/autofdo on April 7. I asked about that change > >>> in that repo and got the following reply: > >>>>> > >>>>> https://github.com/google/autofdo/pull/107#issuecomment-819108738 > >>>>> > >>>>> "Actually we didn't use create_gcov and havn't updated create_gcov for > >>> years, and we also didn't have enough tests to guarantee it works (It was > >>> gcc-4.8 when we used and verified create_gcov). If you need it, it is > >>> welcomed to update create_gcov and add it to the respository." > >>>>> > >>>>> Does this mean that AutoFDO is currently dead in gcc? > >>>> > >>>> Hello. > >>>> > >>>> Yes. I know that even basic test cases have been broken for years in the > >>> GCC. > >>>> It's new to me that create_gcov was removed. > >>>> > >>>> I tend to send patch to GCC that will remove AutoFDO from GCC. > >>>> I known Bin spent some time working on AutoFDO, has he came up to > >>> something? > >>> > >>> The GCC side of auto-FDO is not that hard. We have most of > >>> infrastructure in place, but stopping point for me was always difficulty > >>> to get gcov-tool working. If some maintainer steps up, I think I can > >>> fix GCC side. > >>> > >>> I am bit unsure how important feature it is - we have FDO that works > >>> quite well for most users but I know there are some users of the LLVM > >>> implementation and there is potential to tie this with other hardware > >>> events to asist i.e. if conversion (where one wants to know how well CPU > >>> predicts the jump rather than just the jump probability) which I always > >>> found potentially interesting. > >>> > >>> Honza > >>>> > >>>> Martin > >>>> > >>>>> > >>>>> Thanks, > >>>>> > >>>>> Eugene > >>>>> > >>>> > >>> >
Re: State of AutoFDO in GCC
> On Fri, Apr 23, 2021 at 10:27 AM Xinliang David Li > wrote: > > > > > > > On Fri, Apr 23, 2021 at 10:16 AM Jan Hubicka wrote: > > > >> > > >> > It uses create_llvm_prof tool which is in the same git repo. The data > >> > parsing part is shared with create_gcov, but the writer is obviously > >> > different for the two tools. > >> > >> OK and what are the main differences between llvmand gcc format? > >> > >> GCC expects GCOV format, I think while LLVM uses a newly designed binary > > format. > > > > > Sorry I missed your next reply. I forgot about the details of GCC' format. Thanks for explanation. How hard would be to make GCC consume this format? Is is reasonably stable and documented somewhere? Does LLVM's auto-FDO support non-Intel CPUs these days? Honza > > David > > > > > >> Honza > >> > > >> > David > >> > > >> > > >> > > Honza > >> > > > > >> > > > David > >> > > > > >> > > > > > >> > > > > Thoughts? > >> > > > > Martin > >> > > > > > >> > > > > > > >> > > > > > Having the tool third-party makes keeping the whole chain > >> working > >> > > more > >> > > > > > difficult. > >> > > > > > > >> > > > > > Richard. > >> > > > > > > >> > > > > >> Thanks, > >> > > > > >> > >> > > > > >> David > >> > > > > >> > >> > > > > >> On Thu, Apr 22, 2021 at 3:29 PM Jan Hubicka > >> wrote: > >> > > > > >> > >> > > > > >>>> On 4/22/21 9:58 PM, Eugene Rozenfeld via Gcc wrote: > >> > > > > >>>>> GCC documentation for AutoFDO points to create_gcov tool > >> that > >> > > > > converts > >> > > > > >>> perf.data file into gcov format that can be consumed by gcc > >> with > >> > > > > >>> -fauto-profile ( > >> > > > > https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html, > >> > > > > >>> https://gcc.gnu.org/wiki/AutoFDO/Tutorial). > >> > > > > >>>>> > >> > > > > >>>>> I noticed that the source code for create_gcov has been > >> deleted > >> > > from > >> > > > > >>> https://github.com/google/autofdo on April 7. I asked about > >> that > >> > > > > change > >> > > > > >>> in that repo and got the following reply: > >> > > > > >>>>> > >> > > > > >>>>> > >> > > https://github.com/google/autofdo/pull/107#issuecomment-819108738 > >> > > > > >>>>> > >> > > > > >>>>> "Actually we didn't use create_gcov and havn't updated > >> > > create_gcov > >> > > > > for > >> > > > > >>> years, and we also didn't have enough tests to guarantee it > >> works > >> > > (It > >> > > > > was > >> > > > > >>> gcc-4.8 when we used and verified create_gcov). If you need > >> it, it > >> > > is > >> > > > > >>> welcomed to update create_gcov and add it to the respository." > >> > > > > >>>>> > >> > > > > >>>>> Does this mean that AutoFDO is currently dead in gcc? > >> > > > > >>>> > >> > > > > >>>> Hello. > >> > > > > >>>> > >> > > > > >>>> Yes. I know that even basic test cases have been broken for > >> years > >> > > in > >> > > > > the > >> > > > > >>> GCC. > >> > > > > >>>> It's new to me that create_gcov was removed. > >> > > > > >>>> > >> > > > > >>>> I tend to send patch to GCC that will remove AutoFDO from > >> GCC. > >> > > > > >>>> I known Bin spent some time working on AutoFDO, has he came > >> up to > >> > > > > >>> something? > >> > > > > >>> > >> > > > > >>> The GCC side of auto-FDO is not that hard. We have most of > >> > > > > >>> infrastructure in place, but stopping point for me was always > >> > > > > difficulty > >> > > > > >>> to get gcov-tool working. If some maintainer steps up, I > >> think I > >> > > can > >> > > > > >>> fix GCC side. > >> > > > > >>> > >> > > > > >>> I am bit unsure how important feature it is - we have FDO that > >> > > works > >> > > > > >>> quite well for most users but I know there are some users of > >> the > >> > > LLVM > >> > > > > >>> implementation and there is potential to tie this with other > >> > > hardware > >> > > > > >>> events to asist i.e. if conversion (where one wants to know > >> how > >> > > well > >> > > > > CPU > >> > > > > >>> predicts the jump rather than just the jump probability) > >> which I > >> > > always > >> > > > > >>> found potentially interesting. > >> > > > > >>> > >> > > > > >>> Honza > >> > > > > >>>> > >> > > > > >>>> Martin > >> > > > > >>>> > >> > > > > >>>>> > >> > > > > >>>>> Thanks, > >> > > > > >>>>> > >> > > > > >>>>> Eugene > >> > > > > >>>>> > >> > > > > >>>> > >> > > > > >>> > >> > > > > > >> > > > > > >> > > > >> > >