Re: AutoFDO profile toolchain is open-sourced

2015-05-09 Thread Jan Hubicka
  Yes, it will. But it's not well tuned at all. I will start tuning it
  if I have free cycles. It would be great if opensource community can
  also contribute to this tuning effort.
 
 If you could outline portions of code which needs tuning, rewriting, that 
 will help get started in this effort.

Optimization passes in GCC are generally designed to work with any kind of edge 
profile they get.
There are only few cases where they do care about what profile is around.

At the moment we consider two types of profiles - static (guessed) and FDO. For
static one we shut down use of profile info for some heuristics - for example
we do not expect loop trip counts to be reliable in the profiles because they
are not.  You can look for code checking profile_status_for_fn.

Auto-FDO does not have special value for profile_status_for_fn and it goes with
same code paths for FDO.  Dehao has some patches for Auto-FDO tuning but my
impression is that he got mostly got around by just makng optimizer bit more
robust for nonsential profiles that is always good, since even FDO profiles can
get wrong.  BTW, Dehao, do you think you can submit these changes for this
stage1?

I suppose in this case we have yet another kind of profile that is less 
reliable than
FDO and we need to start by simply benchmarking and looking for cases where 
this profile
gets worse and handle them one by one :)

Honza


Passsing unfinished types to gen_type_die_with_usage

2015-05-14 Thread Jan Hubicka
Hello,
as I mentioned yesterday on IRC adding a check that only complete types
have TYPE_BINFO defined triggers on type:

 record_type 0x76d24d20 __is_integer type_5 type_6 VOID
align 8 symtab 0 alias set -1 canonical type 0x76d24d20
fields const_decl 0x74a89930 __value
type enumeral_type 0x76d260a8 ._8 type integer_type 
0x76d261f8 unsigned int
unsigned SI
size integer_cst 0x76d0e0a8 constant 32
unit size integer_cst 0x76d0e0c0 constant 4
align 32 symtab 0 alias set -1 canonical type 0x76d260a8 
precision 32 min integer_cst 0x76d0e0d8 0 max integer_cst 0x76d0e090 
4294967295
values tree_list 0x76d2e078
purpose identifier_node 0x76ea9c08 __value
bindings (nil)
local bindings 0x76d2a640 value const_decl 
0x74a89930 __value context record_type 0x76d24d20 __is_integer
chain type_decl 0x74aa4da8 ._8
readonly constant VOID file 
/aux/hubicka/trunk-7/build/x86_64-unknown-linux-gnu/libstdc++-v3/include/bits/cpp_type_traits.h
 line 127 col 14
align 1 context enumeral_type 0x76d260a8 ._8 initial integer_cst 
0x76eea1c8 0
chain type_decl 0x74aa4a18 __is_integer type record_type 
0x76d24dc8 __is_integer
external nonlocal in_system_header suppress-debug decl_4 VOID file 
/aux/hubicka/trunk-7/build/x86_64-unknown-linux-gnu/libstdc++-v3/include/bits/cpp_type_traits.h
 line 126 col 5
align 8 context record_type 0x76d24d20 __is_integer result 
record_type 0x76d24d20 __is_integer
chain type_decl 0x74aa4da8 ._8 context namespace_decl 
0x76d08098 std
full-name struct std::__is_integerlong double
n_parents=0 use_template=1 interface-unknown
chain type_decl 0x74aa4688 __is_integer

This type has fields but no TYPE_SIZE, so it seems odd.  It is an instance, not 
template,
so I think the probablem is that layout_type was not called on it.

I debugged the reason why we get type with to TYPE_SIZE but with BINFO.  The
problem is that dwarf2out is visits this type before it is finished.  The BINFO
is added as:

#0  xref_basetypes (ref=0x76d24d20, base_list=0x0) at 
../../gcc/cp/decl.c:12525
#1  0x0078b786 in instantiate_class_template_1 (type=0x76d24d20) at 
../../gcc/cp/pt.c:9256
#2  0x0078d3e7 in instantiate_class_template (type=0x76d24d20) at 
../../gcc/cp/pt.c:9650
#3  0x0088772c in complete_type (type=0x76d24d20) at 
../../gcc/cp/typeck.c:146
#4  0x008dba41 in lookup_member (xbasetype=0x0, name=0x76ea9c08, 
protect=2, want_type=false, complain=3) at ../../gcc/cp/search.c:1213
#5  0x009544db in lookup_qualified_name (scope=0x76d24d20, 
name=0x76ea9c08, is_type_p=false, complain=false) at 
../../gcc/cp/name-lookup.c:4540
#6  0x0079ca43 in tsubst_qualified_id (qualified_id=0x76d2a4d8, 
args=0x76d2caf0, complain=3, in_decl=0x74ac9a00, done=true, 
address_p=false) at ../../gcc/cp/pt.c:12717
#7  0x007a80df in tsubst_copy_and_build (t=0x76d2a4d8, 
args=0x76d2caf0, complain=3, in_decl=0x74ac9a00, function_p=false, 
integral_constant_expression_p=true)
at ../../gcc/cp/pt.c:14760
#8  0x007a67b4 in tsubst_expr (t=0x76d2a4d8, args=0x76d2caf0, 
complain=3, in_decl=0x74ac9a00, integral_constant_expression_p=true) at 
../../gcc/cp/pt.c:14367
#9  0x0078d491 in tsubst_template_arg (t=0x76d2a4d8, 
args=0x76d2caf0, complain=3, in_decl=0x74ac9a00) at 
../../gcc/cp/pt.c:9669
#10 0x0077f3ca in coerce_template_parms (parms=0x76d2a550, 
args=0x76e83960, in_decl=0x74ac9a00, complain=3, require_all_args=true, 
use_default_args=true)
at ../../gcc/cp/pt.c:7142
#11 0x0077fa17 in coerce_innermost_template_parms 
(parms=0x76d2a578, args=0x76e83960, in_decl=0x74ac9a00, complain=3, 
require_all_args=true, use_default_args=true)
at ../../gcc/cp/pt.c:7252
#12 0x00782455 in lookup_template_class_1 (d1=0x76eab6e0, 
arglist=0x76e83960, in_decl=0x0, context=0x76e5bed8, entering_scope=0, 
complain=3) at ../../gcc/cp/pt.c:7727
#13 0x007854e4 in lookup_template_class (d1=0x74ac9a00, 
arglist=0x76e83960, in_decl=0x0, context=0x0, entering_scope=0, 
complain=35) at ../../gcc/cp/pt.c:8043
#14 0x008eaefe in finish_template_type (name=0x74ac9a00, 
args=0x76e83960, entering_scope=0) at ../../gcc/cp/semantics.c:3011
#15 0x0084c7a4 in cp_parser_template_id (parser=0x76be8c00, 
template_keyword_p=false, check_dependency_p=false, tag_type=class_type, 
is_declaration=false)
at ../../gcc/cp/parser.c:13850

and dwarf2out visits it as context when processing another type:

#0  error (gmsgid=0x1abd1a8 TYPE_BINFO and COMPLETE_TYPE_P mismatch) at 
../../gcc/diagnostic.c:1135
#1  0x012f64c7 in verify_type 

Re: g++keeps unused objects with virtual functions

2015-04-08 Thread Jan Hubicka
 which shows how the global objects initialization keeps things live.
 Early optimization turns it into
 
 (static initializers for t.C) ()
 {
   bb 2:
   NotUsedObject._vptr.CObject = MEM[(void *)_ZTV7CObject + 16B];
   return;
 
 }
 
 but we don't have any pass removing stores to globals never read.  IPA
 reference computes this
 in some way but doesn't get to export this in a reasonable way - its
 transform stage could DSE
 those though.

We have write only variable detection but
 NotUsedObject/4 (CObject NotUsedObject) @0x7f70d41fe580
   Type: variable definition analyzed
   Visibility: public
   References:
   Referring: _Z41__static_initialization_and_destruction_0ii/9 (addr)
   Availability: not-ready
   Varpool flags:

it stops on fact believing that address of NotUsedObject is taken. Why it is
not considered to be load?

Honza


Re: g++keeps unused objects with virtual functions

2015-04-08 Thread Jan Hubicka
  which shows how the global objects initialization keeps things live.
  Early optimization turns it into
  
  (static initializers for t.C) ()
  {
bb 2:
NotUsedObject._vptr.CObject = MEM[(void *)_ZTV7CObject + 16B];
return;
  
  }
  
  but we don't have any pass removing stores to globals never read.  IPA
  reference computes this
  in some way but doesn't get to export this in a reasonable way - its
  transform stage could DSE
  those though.
 
 We have write only variable detection but
  NotUsedObject/4 (CObject NotUsedObject) @0x7f70d41fe580
Type: variable definition analyzed
Visibility: public
References:
Referring: _Z41__static_initialization_and_destruction_0ii/9 (addr)
Availability: not-ready
Varpool flags:
 
 it stops on fact believing that address of NotUsedObject is taken. Why it is
 not considered to be load?

Too early dump I guess. At mainline I get empty constructor:
.type   _GLOBAL__sub_I_NotUsedObject, @function
_GLOBAL__sub_I_NotUsedObject:
.LFB6:
rep ret

not ideal, but still an improvement ;)
The problem is that write only var is removed only at late optimization
and then it is too late to remove the unreachable function.

Honza
 
 Honza


Re: AutoFDO profile toolchain is open-sourced

2015-04-10 Thread Jan Hubicka
 On Tue, Apr 7, 2015 at 9:45 AM, Ilya Palachev i.palac...@samsung.com wrote:
  In the mentioned README file it is said that  In order to collect this
  profile, you will need to have an Intel CPU that have last branch record
  (LBR) support. Is this information obsolete? Chrome Canary builds use
  AutoFDO for ARMv7l
  (https://code.google.com/p/chromium/issues/detail?id=434587)
 
 It does not mean that the profile was recorded on an ARM system: they
 can gather perf.data on x86 and then produce a coverage file that is
 then used in ARM compiles.  I tried it and seems to work well.

I must say I did not even try running AutoFDO myself (so I am happy to hear
it works). My understanding is that you need LBR only to get indirect
call profiling working (i.e. you want to know from where the indirect
function is called).

Depending on your application this may not be the most important thing to
record (either you don't have indirect calls in hot paths or they are handled
resonably by speculative devirtualization)

Some ARMs also has support for tracing jump pairs, right?
Honza
 
 Sebastian


Re: AutoFDO profile toolchain is open-sourced

2015-04-10 Thread Jan Hubicka
 LBR is used for both cfg edge profiling and indirect call Target value
 profiling.
I see, that makes sense ;)  I guess if we want to support profile collection
on targets w/o this feature we could still use one of the algorithms that
try to guess edge profile from BB profile.

Honza


Re: lto bootstrap fails.

2015-04-10 Thread Jan Hubicka
 On Fri, Apr 10, 2015 at 11:18:39AM -0400, Trevor Saunders wrote:
  On Fri, Apr 10, 2015 at 03:59:19PM +0200, Toon Moene wrote:
   Like this:
   
   https://gcc.gnu.org/ml/gcc-testresults/2015-04/msg01086.html
   
   ODR rears its head again ...
  
   huh, why is c/c-lang.h getting included in files linked into cc1plus?
   that seems strange.
 
 readelf -wl cc1plus | grep c-lang.h
 doesn't show anything.

I tried to reproduce it and my bootstrap passes with same options as Toon's
The following patch ought to be able to tell the particular translation unit
causing the conflict.

Index: tree.c
===
--- tree.c  (revision 221977)
+++ tree.c  (working copy)
@@ -4679,6 +4679,8 @@ build_translation_unit_decl (tree name)
   tree tu = build_decl (UNKNOWN_LOCATION, TRANSLATION_UNIT_DECL,
name, NULL_TREE);
   TRANSLATION_UNIT_LANGUAGE (tu) = lang_hooks.name;
+  if (main_input_filename)
+DECL_NAME (tu) = get_identifier (main_input_filename);
   vec_safe_push (all_translation_units, tu);
   return tu;
 }
Index: ipa-devirt.c
===
--- ipa-devirt.c(revision 221977)
+++ ipa-devirt.c(working copy)
@@ -979,6 +979,21 @@ warn_odr (tree t1, tree t2, tree st1, tr
 return;
   inform (DECL_SOURCE_LOCATION (decl2), reason);
 
+  tree name = TYPE_NAME (t1);
+  tree name1 = decl2;
+  /* See if we have info about the translation unit.  It may not be around
+   if types was already merged.   */
+while (TREE_CODE (name) != TRANSLATION_UNIT_DECL)
+  name = TYPE_P (name) ? TYPE_CONTEXT (name) : DECL_CONTEXT (name);
+while (TREE_CODE (name1) != TRANSLATION_UNIT_DECL)
+  name1 = TYPE_P (name1) ? TYPE_CONTEXT (name1) : DECL_CONTEXT (name1);
+name = DECL_NAME (name);
+name1 = DECL_NAME (name1);
+if (name != name1  name  name1)
+  inform (UNKNOWN_LOCATION, Conflicting compilation units: %s and %s,
+ IDENTIFIER_POINTER (name),
+ IDENTIFIER_POINTER (name1));
+
   if (warned)
 *warned = true;
 }


Re: debug-early branch merged into mainline

2015-06-06 Thread Jan Hubicka
 On 06/06/2015 03:33 PM, Jan Hubicka wrote:
 Aldy,
 also at PPC64le LTO bootstrap (at gcc112) dies with:
 ^
 0x104ae8f7 check_die
  ../../gcc/dwarf2out.c:5715
 
 Hmmm... this is in the LTO/ltrans stage?  If so, that's weird.  The
 LTO path does not do the early DIE dance.  Since check_die() is a
 new sanity check for DIEs, I wonder if this is a latent bug that was
 already there.
 
 How do I reproduce/do the LTO bootstrap thing?

I simply bootstrap with 

configure --enable-languages=all,obj-c++,go --disable-werror 
--with-build-config=bootstrap-lt

It is during ltrans stage.

Honza
 
 Aldy
 
 0x104e4e1b dwarf2out_decl
  ../../gcc/dwarf2out.c:21886
 0x104d8a87 dwarf2out_abstract_function
  ../../gcc/dwarf2out.c:18457
 0x104dd3e7 gen_inlined_subroutine_die
  ../../gcc/dwarf2out.c:19891
 0x104e11f7 gen_block_die
  ../../gcc/dwarf2out.c:20933
 0x104e17b3 decls_for_scope
  ../../gcc/dwarf2out.c:21006
 0x104da907 gen_subprogram_die
  ../../gcc/dwarf2out.c:19167
 0x104e343b gen_decl_die
  ../../gcc/dwarf2out.c:21409
 0x104e4dfb dwarf2out_decl
  ../../gcc/dwarf2out.c:21882
 0x104e4e6f dwarf2out_function_decl
  ../../gcc/dwarf2out.c:21894
 0x105ab99f rest_of_handle_final
  ../../gcc/final.c:4505
 0x105abd27 execute
  ../../gcc/final.c:4547
 Please submit a full bug report,
 with preprocessed source if appropriate.
 


Re: debug-early branch merged into mainline

2015-06-06 Thread Jan Hubicka
  On 06/06/2015 03:33 PM, Jan Hubicka wrote:
  Aldy,
  also at PPC64le LTO bootstrap (at gcc112) dies with:
  ^
  0x104ae8f7 check_die
   ../../gcc/dwarf2out.c:5715
  
  Hmmm... this is in the LTO/ltrans stage?  If so, that's weird.  The
  LTO path does not do the early DIE dance.  Since check_die() is a
  new sanity check for DIEs, I wonder if this is a latent bug that was
  already there.
  
  How do I reproduce/do the LTO bootstrap thing?
 
 I simply bootstrap with 
 
 configure --enable-languages=all,obj-c++,go --disable-werror 
 --with-build-config=bootstrap-lt
Ah, not full command line:
configure --enable-languages=all,obj-c++,go --disable-werror 
--with-build-config=bootstrap-lto

I will try your check die patch but probably not until Monday afternoon - have 
real life deadlines
till then.

Honza


Re: debug-early branch merged into mainline

2015-06-06 Thread Jan Hubicka
Aldy,
also at PPC64le LTO bootstrap (at gcc112) dies with:
^
0x104ae8f7 check_die
../../gcc/dwarf2out.c:5715
0x104e4e1b dwarf2out_decl
../../gcc/dwarf2out.c:21886
0x104d8a87 dwarf2out_abstract_function
../../gcc/dwarf2out.c:18457
0x104dd3e7 gen_inlined_subroutine_die
../../gcc/dwarf2out.c:19891
0x104e11f7 gen_block_die
../../gcc/dwarf2out.c:20933
0x104e17b3 decls_for_scope
../../gcc/dwarf2out.c:21006
0x104da907 gen_subprogram_die
../../gcc/dwarf2out.c:19167
0x104e343b gen_decl_die
../../gcc/dwarf2out.c:21409
0x104e4dfb dwarf2out_decl
../../gcc/dwarf2out.c:21882
0x104e4e6f dwarf2out_function_decl
../../gcc/dwarf2out.c:21894
0x105ab99f rest_of_handle_final
../../gcc/final.c:4505
0x105abd27 execute
../../gcc/final.c:4547
Please submit a full bug report,
with preprocessed source if appropriate.



GNU Tools Cauldron 2015 - Call for Participation

2015-06-01 Thread Jan Hubicka
Hello,
A reminder about this year's Cauldron (http://gcc.gnu.org/wiki/cauldron2015).
The webpage was just updated with a list of accepted talks and BoFs as well
as a brief review of the schedule to let you plan your travel.

With 19 of very relevant talks and 10 BoFs we will have a busy schedule.  We
are still able to accept last minute submissions to be scheduled in parallel
with other presentations.  To register your abstract, send e-mail to
tools-cauldron-ad...@googlegroups.com.  Please do so early so we can adjust
schedule accordingly. If you submitted an abstract but it is not in the list,
please let us know, too.

If you intend to participate, but not necessarily present, please let us know
as soon as possible.  This will allow us to plan for space and food.  Send a
message to tools-cauldron-ad...@googlegroups.com stating your intent to
participate.

Jan


Re: getrlimit compatibility issues

2015-08-31 Thread Jan Hubicka
> @@ -811,5 +806,9 @@ lto_symtab_prevailing_decl (tree decl)
>if (!ret)
>  return decl;
>  
> +  /* Do not replace a non-builtin with a builtin.  */
> +  if (is_builtin_fn (ret->decl))
> +return decl;
> +

Yep, this looks like a resonable direction. It will break the one declaration
rule in a more wild sense than current frontends does so, because if a builtin
win as a prevailing declaration, we end up with no merging at all.
I wonder if we don't want to always prevail to first non-builtin variant?

(and yep, i need to push out the syntatctic aliases patch which will make
duplicated decls well behaved; will try to do so soonish)
Honza


Re: getrlimit compatibility issues

2015-09-01 Thread Jan Hubicka
> > Yep, this looks like a resonable direction. It will break the one 
> > declaration
> > rule in a more wild sense than current frontends does so, because if a 
> > builtin
> > win as a prevailing declaration, we end up with no merging at all.
> > I wonder if we don't want to always prevail to first non-builtin variant?
> 
> I think we already try.  But this can be tricky as seen by the following
> which is needed to finally fix LTO bootstrap on trunk ...
> 
> Basically there are some function decls that do not go through
> lto-symtab handling because they are not in any CUs cgraph but
> only exist because they are refered to from BLOCK abstract origins
> (and thus also LTRANS boundary streaming doesn't force them
> into the cgraph).  For example during LTO bootstrap I see
> this from inlining of split functions where both split parts and
> the original function are optimized away before stream-out.

Yep, I remember poking around this when I finally stopped my attempts
to get abstract origins working with LTO :(
> 
> We hit
> 
> static void
> gen_inlined_subroutine_die (tree stmt, dw_die_ref context_die)
> {
> ...
>   /* Make sure any inlined functions are known to be inlineable.  */
>   gcc_checking_assert (DECL_ABSTRACT_P (decl)
>|| cgraph_function_possibly_inlined_p (decl));
> 
> a lot without the following fixes.  I suppose we _can_ merge
> function decls which just differ in DECL_POSSIBLY_INLINED if
> we make sure to "merge" the flag during tree merging.  But
> that's for a followup, below is for correctness.

How this is going to work with early debug? If we merge one function to
another, won't we move the references inside blocks to references to another
block tree that is possibly incompatible with one we expect?
> 
> We were also missing to compare DECL_ABSTRACT_ORIGIN during
> tree merging (not sure if that caused any issue, I just run
> into the clone abstract origin issue above and saw that).
> 
> LTO bootstrapped and tested on x86_64-unknown-linux-gnu, I'll throw
> it on regular bootstrap & test and then plan to commit it unless
> you object that lto-symtab.c hunk...
> 
> Richard.
> 
> 2015-08-31  Richard Biener  
> 
>   lto/
>   * lto.c (compare_tree_sccs_1): Compare DECL_ABSTRACT_ORIGIN.
>   * lto-symtab.c (lto_symtab_merge): Merge DECL_POSSIBLY_INLINED flag.
>   (lto_symtab_prevailing_decl): Do not replace a decl that didn't
>   participate in merging with something else.
> 
> Index: gcc/lto/lto.c
> ===
> --- gcc/lto/lto.c (revision 227339)
> +++ gcc/lto/lto.c (working copy)
> @@ -1305,6 +1305,7 @@ compare_tree_sccs_1 (tree t1, tree t2, t
>compare_tree_edges (DECL_SIZE (t1), DECL_SIZE (t2));
>compare_tree_edges (DECL_SIZE_UNIT (t1), DECL_SIZE_UNIT (t2));
>compare_tree_edges (DECL_ATTRIBUTES (t1), DECL_ATTRIBUTES (t2));
> +  compare_tree_edges (DECL_ABSTRACT_ORIGIN (t1), DECL_ABSTRACT_ORIGIN 
> (t2));
>if ((code == VAR_DECL
>  || code == PARM_DECL)
> && DECL_HAS_VALUE_EXPR_P (t1))
> Index: gcc/lto/lto-symtab.c
> ===
> --- gcc/lto/lto-symtab.c  (revision 227339)
> +++ gcc/lto/lto-symtab.c  (working copy)
> @@ -312,6 +312,11 @@ lto_symtab_merge (symtab_node *prevailin
>  
>if (TREE_CODE (decl) == FUNCTION_DECL)
>  {
> +  /* Merge decl state in both directions, we may still end up using
> +  the new decl.  */
> +  DECL_POSSIBLY_INLINED (prevailing_decl) |= DECL_POSSIBLY_INLINED 
> (decl);
> +  DECL_POSSIBLY_INLINED (decl) |= DECL_POSSIBLY_INLINED 
> (prevailing_decl);
> +
>if (warn_type_compatibility_p (TREE_TYPE (prevailing_decl),
>TREE_TYPE (decl)))
>   return false;
> @@ -798,6 +803,18 @@ lto_symtab_prevailing_decl (tree decl)
>if (TREE_CODE (decl) == FUNCTION_DECL && DECL_ABSTRACT_P (decl))
>  return decl;
>  
> +  /* When decl did not participate in symbol resolution leave it alone.
> + This can happen when we streamed the decl as abstract origin
> + from the block tree of inlining a partially inlined function.
> + If all, the split function and the original function end up
> + optimized away early we do not put the abstract origin into the
> + ltrans boundary and we'll end up ICEing in
> + dwarf2out.c:gen_inlined_subroutine_die because we eventually
> + replace a decl with DECL_POSSIBLY_INLINED set with one without.  */
> +  if (TREE_CODE (decl) == FUNCTION_DECL
> +  && ! cgraph_node::get (decl))
> +return decl;

So here you care about not merging DECLs that are not part of symtab with
decls that are part of symtab? I have bit hard time to understand why that
helps.

The rest of patch makes sense to me modulo the possible issues with moving
ECL_ABSTRACT_ORIGIN reference to an incompatible blob of dwarf info comming

Re: getrlimit compatibility issues

2015-09-01 Thread Jan Hubicka
> On Tue, 1 Sep 2015, Jan Hubicka wrote:
> 
> > > > Yep, this looks like a resonable direction. It will break the one 
> > > > declaration
> > > > rule in a more wild sense than current frontends does so, because if a 
> > > > builtin
> > > > win as a prevailing declaration, we end up with no merging at all.
> > > > I wonder if we don't want to always prevail to first non-builtin 
> > > > variant?
> > > 
> > > I think we already try.  But this can be tricky as seen by the following
> > > which is needed to finally fix LTO bootstrap on trunk ...
> > > 
> > > Basically there are some function decls that do not go through
> > > lto-symtab handling because they are not in any CUs cgraph but
> > > only exist because they are refered to from BLOCK abstract origins
> > > (and thus also LTRANS boundary streaming doesn't force them
> > > into the cgraph).  For example during LTO bootstrap I see
> > > this from inlining of split functions where both split parts and
> > > the original function are optimized away before stream-out.
> > 
> > Yep, I remember poking around this when I finally stopped my attempts
> > to get abstract origins working with LTO :(
> > > 
> > > We hit
> > > 
> > > static void
> > > gen_inlined_subroutine_die (tree stmt, dw_die_ref context_die)
> > > {
> > > ...
> > >   /* Make sure any inlined functions are known to be inlineable.  */
> > >   gcc_checking_assert (DECL_ABSTRACT_P (decl)
> > >|| cgraph_function_possibly_inlined_p (decl));
> > > 
> > > a lot without the following fixes.  I suppose we _can_ merge
> > > function decls which just differ in DECL_POSSIBLY_INLINED if
> > > we make sure to "merge" the flag during tree merging.  But
> > > that's for a followup, below is for correctness.
> > 
> > How this is going to work with early debug? If we merge one function to
> > another, won't we move the references inside blocks to references to another
> > block tree that is possibly incompatible with one we expect?
> 
> Hmm, indeed, if we merge the abstract instance FUNCTION_DECL we end
> up adjusting the abstract origin DIE reference accordingly but as
> we don't merge BLOCKs the actual BLOCK abstract origins would refer
> to the "original" early debug abstract instance.  Not sure if that
> will cause problems for the debug consumer - the abstract instances
> should be the same (literally, unless you play tricks with #ifdefs
> in different uses...).
> 
> > The rest of patch makes sense to me modulo the possible issues with moving
> > ECL_ABSTRACT_ORIGIN reference to an incompatible blob of dwarf info comming
> > from other instance of the inlined function from other translation unit.
> > 
> > I see you added DECL_ABSTRACT_ORIGIN comparer, but that won't make block 
> > numbers
> > to match. 
> 
> Who cares about block numbers?  Note the abstract origin comparer was
> for the clone abstract origins we have.

Yep, no one in dwarf2out.  My original attempt used it to refer to BLOCKs within
a different function body (BLOCK_ABSTRACT_ORIGIN). We currently don't stream 
it, so
we don't need to refer to it.

In longer run we will however need a way to make these references to work...
Merging them will be tricky as it would require comparing the function's block 
trees.
Especially after early inlining these tends to diverge, but perhaps the early 
debug is output
before early inlining and thus this problem goes away and we only care about 
different ifdefs.
Those are quite common, too, however, at least in Firefox :(

Honza
> 
> Richard.


Re: ipa vrp implementation in gcc

2016-01-11 Thread Jan Hubicka
> On Mon, Jan 11, 2016 at 1:38 AM, Kugan
> <kugan.vivekanandara...@linaro.org> wrote:
> > Hi All,
> >
> > I am looking at implementing a ipa vrp pass. Jan Hubicka also talks
> > about this in 2013 GNU Cauldron as one of the optimization he would like
> > to see in gcc. So my question is, is any one implementing it. If not we
> > would like to do that.
> >
> > I also looked at the ipa-cp implementation to see how this can be done.
> > Going by this, one of the way to implement this is (skipping all the
> > details):
> >
> > - Have an early tree-vrp so that we can have value ranges for parameters
> > at call sites.
> 
> I'd rather use the IPA analysis phase for this and use a VRP algorithm
> that doesn't require ASSERT_EXPR insertion.

Another potential use of value ranges is the profile estimation. 
http://www.lighterra.com/papers/valuerangeprop/Patterson1995-ValueRangeProp.pdf
It seems to me that we may want to have something that can feed sane loop
bounds for profile estimation as well and we can easily store the known
value ranges to SSA name annotations.
So I think separate local pass to compute value ranges (perhaps with less
accuracy than full blown VRP) is desirable.
> 
> > - Create jump functions that captures the value ranges of call sites
> > propagate the value ranges. In 2013 talk, Jan Hubicka talks about
> >
> > - Modifying ipa-prop.[h|c] to handles this but wouldn't it be easier to
> > have its own and much more simpler implementation ?
> 
> No idea.

I think the ipa-prop.c probably won't need any siginificant changes.  The 
code basically analyze what values are passed thorugh the function and
this works for constants as well as for intervals. In fact ipa-cp already
uses the same ipa-prop analysis for 
 1) constant propagation
 2) alignment propagation
 3) propagation of known polymorphic call contextes.

So replacing 1) by value range propagation should be easily doable. 
I would also like to replace alignment propagation by bitwise constant
propagation (i.e. propagating what bits are known to be zero and what
bits are known to be one). We already do have bitwise CCP, so we could
deal with this basically in the same way as we deal with value ranges.

ipa-prop could use bit of clenaing up and modularizing that I hope will
be done next stage1 :)

> 
> > - Once we have the value ranges for parameter/return values, we could
> > rely on tree-vrp to use this and do the optimizations
> 
> Yep.  IPA transform phase should annotate parameter default defs with
> computed ranges.

Yep, in addition we will end up with known value ranges stored in aggregates
for that we need better separate representaiton

See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68930
> 
> > Does this make any sense? Any thoughts/suggestions to work on this is
> > highly appreciated.
> 
> IPA alignment propagation should already be somewhat similar as in doing
> an intersection step during propagation.

The merge operation is same as VRP does, but yep, IPA alignment code is good
one to follow for start.  Martin implemented it in a bit limited way so it
never performs cloning, but we can deal with that incrementally.
> 
> Richard.
> 
> > Thanks,
> > Kugan


Re: help with PR69133

2016-01-18 Thread Jan Hubicka
> On Mon, Jan 18, 2016 at 10:28 AM, Prathamesh Kulkarni
>  wrote:
> > On 17 January 2016 at 14:56, Prathamesh Kulkarni
> >  wrote:
> >> Hi,
> >> I was having a look at PR69133.
> >> It appears that with -flto-partition=none,
> >> cgraph_node::get_untransformed_body ()
> >> is called twice for node with asm_name _ZThn4_N11xercesc_3_11C5m_fn6ERKi.
> >> c++filt says it is: non-virtual thunk to xercesc_3_1::C::m_fn6(int const&)
> >>
> >> get_untransformed_body() calls
> >> lto_free_function_in_decl_state_for_node() which sets
> >> node->lto_file_data to NULL.
> >> Now 2nd time get_untransformed_body is called for the same node,
> >> node->lto_file_data is NULL and lto_get_decl_name_mapping() dereferences
> >> lto_file_data which results in segfault.
> >>
> >> I was wondering if gating on lto_file_data could be a reasonable solution ?
> >> if (!lto_file_data)
> >>   return false;
> >> I am not sure what value to return for this case (I chose false
> >> because of early exit).
> >> It prevents the ICE, and the patch passes bootstrap+test on
> >> x86_64-unknown-linux-gnu.
> >> With partitoning enabled, for the above test-case,
> >> get_untransformed_body () is called only once per node. However I
> >> don't understand why it gets called twice with -flto-partition=none.
> > Kugan pointed out to me the test-case doesn't ICE by passing -fno-ipa-icf.
> 
> It looks to me we are not supposed to call this twice on the same node
> but nothing guards against that (and we don't release the untransformed
> body either).  -flto-partition=none is special here becaue it combines WPA

You are suppose to read every function body just once from LTO stream. Doing
it multiple times is a waste.  You can call get_body and get_untransformed_body
as many times as you want as tHere is guard:
bool
cgraph_node::get_untransformed_body (void)
{
  lto_file_decl_data *file_data;
  const char *data, *name;
  size_t len;
  tree decl = this->decl;

  if (DECL_RESULT (decl))
return false;

once you read the body DECL_RESULT is set.  Only way you can get past this
to ICE should be the case where you read the body and then trhow it away.

> and late stage.  IPA ICF is the only IPA pass needing untransformed bodies
> during WPA.

That is not exactly true.  We also read untransformed bodies when merging
profiles.

I will check the PR. ipa-icf is not supposed to get body of thunks.

Honza


Re: ipa vrp implementation in gcc

2016-01-18 Thread Jan Hubicka
> On Mon, Jan 18, 2016 at 12:00 AM, Kugan
>  wrote:
> > Hi,
> >
> >> Another potential use of value ranges is the profile estimation.
> >> http://www.lighterra.com/papers/valuerangeprop/Patterson1995-ValueRangeProp.pdf
> >> It seems to me that we may want to have something that can feed sane loop
> >> bounds for profile estimation as well and we can easily store the known
> >> value ranges to SSA name annotations.
> >> So I think separate local pass to compute value ranges (perhaps with less
> >> accuracy than full blown VRP) is desirable.
> >
> > Thanks for the reference. I am looking at implementing a local pass for
> > VRP. The value range computation in tree-vrp is based on the above
> > reference and uses ASSERT_EXPR insertion (I understand that you posted
> > the reference above for profile estimation). As Richard mentioned in his
> > reply, the local pass should not rely on ASSERT_EXPR insertion.
> > Therefore, do you have any specific algorithm in mind (i.e. Any
> > published paper or reference from book)?. Of course we can tweak the
> > algorithm from the reference above but would like to understand what
> > your intension are.
> 
> I have (very incomplete) prototype patches to do a dominator-based
> approach instead (what is refered to downthread as non-iterating approach).
> That's cheaper and is what I'd like to provide as an "utility style" interface
> to things liker niter analysis which need range-info based on a specific
> dominator (the loop header for example).

In general, given that we have existing VRP implementation I would suggest
first implementing the IPA propagation and profile estimation bits using
existing VRP pass and then try to compare the simple dominator based approach
with the VRP we have and see what are the compile time/code quality effects
of both. Based on that we can decide how complex VRP we really want.

It will be probably also more fun to implement it this way :)
I plan to collect some data on early VRP and firefox today or tomorrow.

Honza
> 
> Richard.
> 
> >> I think the ipa-prop.c probably won't need any siginificant changes.  The
> >> code basically analyze what values are passed thorugh the function and
> >> this works for constants as well as for intervals. In fact ipa-cp already
> >> uses the same ipa-prop analysis for
> >>  1) constant propagation
> >>  2) alignment propagation
> >>  3) propagation of known polymorphic call contextes.
> >>
> >> So replacing 1) by value range propagation should be easily doable.
> >> I would also like to replace alignment propagation by bitwise constant
> >> propagation (i.e. propagating what bits are known to be zero and what
> >> bits are known to be one). We already do have bitwise CCP, so we could
> >> deal with this basically in the same way as we deal with value ranges.
> >>
> >> ipa-prop could use bit of clenaing up and modularizing that I hope will
> >> be done next stage1 :)
> >
> > We (Myself and Prathamesh) are interested in working on LTO
> > improvements. Let us have a look at this.
> >
> >>>
>  - Once we have the value ranges for parameter/return values, we could
>  rely on tree-vrp to use this and do the optimizations
> >>>
> >>> Yep.  IPA transform phase should annotate parameter default defs with
> >>> computed ranges.
> >>
> >> Yep, in addition we will end up with known value ranges stored in 
> >> aggregates
> >> for that we need better separate representaiton
> >>
> >> See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68930
> >>>
> >
> > Thanks,
> > Kugan


Re: r235766 incomplete?

2016-05-02 Thread Jan Hubicka
> On Mon, 2016-05-02 at 11:50 -0600, Martin Sebor wrote:
> > Hi Jan,
> > 
> > I just noticed the compilation errors in the attached file with
> > the latest trunk.  It seems as though your recent patch below may
> > be incomplete:
> > 
> >commit 46e5dccc6f188bd0fd5af4e9778f547ab63c9cae
> >Author: hubicka <hubicka@138bc75d-0d04-0410-961f-82ee72b054a4>
> >Date: Mon May 2 16:55:56 2016 +
> > 
> > The following change causes compilation errors due to
> > ipa_find_agg_cst_for_param taking just three arguments, while it
> > is being called with four.  (I haven't looked into the other error.)
> > 
> > Regards
> > Martin
> > 
> > --- a/gcc/ipa-inline-analysis.c
> > +++ b/gcc/ipa-inline-analysis.c
> > @@ -850,7 +850,8 @@ evaluate_conditions_for_known_args (struct 
> > cgraph_node *node
> > ,
> >if (known_aggs.exists ())
> >  {
> >agg = known_aggs[c->operand_num];
> > - val = ipa_find_agg_cst_for_param (agg, c->offset, c
> > ->by_ref);
> > + val = ipa_find_agg_cst_for_param (agg, 
> > known_vals[c->operand_num],
> > +   c->offset, c
> > ->by_ref);
> 
> I saw this too (with r235766).  I believe it's fixed by r235770 and
> r235771:
> 
>  2016-05-02  Jan Hubicka  <hubi...@ucw.cz>
>  
>   * cif-code.def (CIF_THUNK): Add.
>   * ipa-inline-analsysis.c (evaluate_conditions_for_known_args): Revert
>   accidental change.
> 
> (albeit with a typo in that second filename)
Uh, thanks. Will fix that with next commit.

I amanaged to accidentaly bundle unrelated changes to the patch. My apologizes
for that. Will try to keep my commit tree clean.
Honza


Re: How to check if two symbols are from same source unit during WPA ?

2016-04-15 Thread Jan Hubicka
> On Thu, 14 Apr 2016, Ramana Radhakrishnan wrote:
> 
> > 
> > > 
> > > What happens in practice?  GCC doesn't put functions in random
> > > partitions.
> > > 
> > 
> > The data goes into a separate partition AFAIU - it means that all data 
> > accesses are as though they are extern references which means there's 
> > not necessarily any CSE'ing ability that's available with section 
> > anchors.
> 
> No, they are added to partitions referencing them.  Actually they
> end up in the first partition that references them.

Yeah, this can be easily tuned to distribute variables last and put them
to partitioning referncing them most often.

Balanced partitioning tries to preserve source order (unless FDO tells 
otehrwise)
and thus we could add anchros into the cost metrics it uses to split partitions.
> 
> > >> If it's not desired by default could we gate it on an option ?
> > >> AFIAU, section anchors optimization is important for ARM and AArch64.
> > > 
> > > For code size or for performance?  I wonder why section anchors cannot
> > > be "implemented" using some special relocations and thus as linker
> > > optimization.
> > 
> > For performance (and probably code size too) as you can end up CSE'ing 
> > the anchor point. the difference in performance with -flto-partitions=1 
> > is visible on quite a few of the spec2k benchmarks. I don't remember 
> > which ones immediately but yeah it makes a difference.
> 
> Yeah, as said elsewhere for things like spec2k we should have been
> less aggressive with the goal to parallelize and build larger
> partitions for small programs.  Thus increase lto-min-partition
> up to a point where all benchmarks end up in a single partition
> (small benchmarks, so spec2k6 doesn't count at all).  After all
> our inlining limits only start with large-unit-insns as well,
> which is 10 times of lto-min-partition...

Well, I don't think large-unit-insns is particularly related to
lto-min-partition. But lto-min-partition was not really tuned at all
(I remember I dumped the value for some smal spec2k benchmark and based
it on that).  We could do some experiments. But for larger programs we
want more stable solution. Read/write globals are not that common,
but readonly globals are common in modern and large codebases, too.

Posibility is to assign achors at WPA time disabling any chance for
variables to be optimized out later. 

Honza
> 
> Richard.


Re: [RFC] noipa attribute (was Re: How to avoid constant propagation into functions?)

2016-12-15 Thread Jan Hubicka
> Hi!
> 
> So here is a proof of concept of an attribute that disables inlining,
> cloning, ICF, IPA VRP, IPA bit CCP, IPA RA, pure/const/throw discovery.
> Does it look reasonable?  Anything still missing?

I think you also want to disable optimizations we do about local functions
(stack alignment propagation, calling convention changes), IPA-SRA, IPA-PTA and
possibly ipa-split.

If you want it to be bi-directional it is also necessary to prevent inlining
into the function with attribute and similarly avoiding other results of IPA
analysis.

Honza

> No testsuite coverage yet, I bet we'd want to check for all those opts and
> see that they aren't happening if the attribute is present.
> 
> 2016-12-15  Jakub Jelinek  
> 
>   * attribs.c (decl_attributes): Imply noinline, noclone, no_icf and
>   used attributes for noipa attribute.  For naked attribute use
>   lookup_attribute first before lookup_attribute_spec.
>   * final.c (rest_of_handle_final): Disable IPA RA for functions with
>   noipa attribute.
>   * cgraph.c (cgraph_node::get_availability): Return AVAIL_INTERPOSABLE
>   for functions with noipa attribute.
>   * doc/extend.texi: Document noipa function attribute.
> c-family/
>   * c-attribs.c (c_common_attribute_table): Add noipa attribute.
>   (handle_noipa_attribute): New function.
> 
> --- gcc/attribs.c.jj  2016-10-31 13:28:05.0 +0100
> +++ gcc/attribs.c 2016-12-15 10:50:54.809973594 +0100
> @@ -404,8 +404,8 @@ decl_attributes (tree *node, tree attrib
>   those targets that support it.  */
>if (TREE_CODE (*node) == FUNCTION_DECL
>&& attributes
> -  && lookup_attribute_spec (get_identifier ("naked"))
> -  && lookup_attribute ("naked", attributes) != NULL)
> +  && lookup_attribute ("naked", attributes) != NULL
> +  && lookup_attribute_spec (get_identifier ("naked")))
>  {
>if (lookup_attribute ("noinline", attributes) == NULL)
>   attributes = tree_cons (get_identifier ("noinline"), NULL, attributes);
> @@ -414,6 +414,26 @@ decl_attributes (tree *node, tree attrib
>   attributes = tree_cons (get_identifier ("noclone"),  NULL, attributes);
>  }
>  
> +  /* A "noipa" function attribute implies "noinline", "noclone", "no_icf" and
> + "used" for those targets that support it.  */
> +  if (TREE_CODE (*node) == FUNCTION_DECL
> +  && attributes
> +  && lookup_attribute ("noipa", attributes) != NULL
> +  && lookup_attribute_spec (get_identifier ("noipa")))
> +{
> +  if (lookup_attribute ("noinline", attributes) == NULL)
> + attributes = tree_cons (get_identifier ("noinline"), NULL, attributes);
> +
> +  if (lookup_attribute ("noclone", attributes) == NULL)
> + attributes = tree_cons (get_identifier ("noclone"),  NULL, attributes);
> +
> +  if (lookup_attribute ("no_icf", attributes) == NULL)
> + attributes = tree_cons (get_identifier ("no_icf"),  NULL, attributes);
> +
> +  if (lookup_attribute ("used", attributes) == NULL)
> + attributes = tree_cons (get_identifier ("used"),  NULL, attributes);
> +}
> +
>targetm.insert_attributes (*node, );
>  
>for (a = attributes; a; a = TREE_CHAIN (a))
> --- gcc/final.c.jj2016-11-25 09:49:47.0 +0100
> +++ gcc/final.c   2016-12-15 11:38:10.660080949 +0100
> @@ -4473,7 +4473,8 @@ rest_of_handle_final (void)
>assemble_start_function (current_function_decl, fnname);
>final_start_function (get_insns (), asm_out_file, optimize);
>final (get_insns (), asm_out_file, optimize);
> -  if (flag_ipa_ra)
> +  if (flag_ipa_ra
> +  && !lookup_attribute ("noipa", DECL_ATTRIBUTES 
> (current_function_decl)))
>  collect_fn_hard_reg_usage ();
>final_end_function ();
>  
> --- gcc/cgraph.c.jj   2016-12-07 20:10:16.0 +0100
> +++ gcc/cgraph.c  2016-12-15 12:20:11.449481168 +0100
> @@ -2255,7 +2255,8 @@ cgraph_node::get_availability (symtab_no
>  avail = AVAIL_AVAILABLE;
>else if (transparent_alias)
>  ultimate_alias_target (, ref);
> -  else if (lookup_attribute ("ifunc", DECL_ATTRIBUTES (decl)))
> +  else if (lookup_attribute ("ifunc", DECL_ATTRIBUTES (decl))
> +|| lookup_attribute ("noipa", DECL_ATTRIBUTES (decl)))
>  avail = AVAIL_INTERPOSABLE;
>else if (!externally_visible)
>  avail = AVAIL_AVAILABLE;
> --- gcc/doc/extend.texi.jj2016-12-15 11:26:07.0 +0100
> +++ gcc/doc/extend.texi   2016-12-15 12:19:32.738996533 +0100
> @@ -2955,6 +2955,15 @@ asm ("");
>  (@pxref{Extended Asm}) in the called function, to serve as a special
>  side-effect.
>  
> +@item noipa
> +@cindex @code{noipa} function attribute
> +Disable interprocedural optimizations between the function with this
> +attribute and its callers, as if the body of the function is not available
> +when optimizing callers and the callers are unavailable when optimizing
> +the body.  This attribute implies @code{noinline}, @code{noclone},
> 

Re: Worse code after bbro?

2017-01-05 Thread Jan Hubicka
> On 01/04/2017 03:46 AM, Segher Boessenkool wrote:
> >On Wed, Jan 04, 2017 at 10:05:49AM +0100, Richard Biener wrote:
> >>>The code size is identical, but the trunk version executes one more
> >>>instruction everytime the loop runs (explicit jump to .L5 with trunk vs
> >>>fallthrough with 4.8) - it's faster only if the loop never runs. This
> >>>happens irrespective of the memory clobber inline assembler statement.
> >
> >With -Os you've asked for smaller code, not faster code.
> >
> >All of the block reordering is based on heuristics -- there is no polynomial
> >time and space algorithm to do it optimally, let alone the linear time and
> >space we need in GCC -- so there always will be cases we do not handle
> >optimally.  -Os does not get as much attention as -O2 etc., as well.
> >
> >OTOH this seems to be a pretty common case that we could handle.  Please
> >open a PR to keep track of this?
> I superficially looked at this a little while ago and concluded that it's
> something we ought to be able to handle.  However, it wasn't critical enough
> to me to get familiar enough with the bbro code to deeply analyze -- thus I
> put it into my gcc-8 queue.

The heuristics should handle such simple case just fine.
I guess some bug crept in during the years.
> 
> 
> >
> >>I belive that doing BB reorder in CFG layout mode is fundamentally
> >>flawed but I guess it's wired up so that out-of-CFG layout honors
> >>EDGE_FALLTHRU.
> >
> >Why is this fundamentally flawed?  The reordering is much easier this way.
> Agreed (that we ought to be doing reordering in CFG layout mode).

In fact cfglayout was invented to implement bb-reorder originally :)

Honza
> 
> Jeff


Re: Question about inlining and strict_aliasing

2017-01-09 Thread Jan Hubicka
> Hello.
> 
> I've been working on a patch that would cope with target and optimization 
> (read PerFunction)
> in a proper way. I came to following test-case (slightly modified 
> ./gcc/testsuite/gcc.c-torture/execute/alias-1.c):
> 
> int val;
> 
> int *ptr = 
> float *ptr2 = 
> 
> static
> __attribute__((always_inline, optimize ("-fno-strict-aliasing")))
> typepun ()
> {
>   *ptr2=0;
> }
> 
> main()
> {
>   *ptr=1;
>   typepun ();
>   if (*ptr)
> __builtin_abort ();
> }
> 
> $ gcc -O2 /tmp/always-inline.c -fdump-tree-all-details && ./a.out
> Aborted (core dumped)
> 
> Problem is that einline does:
> 
>   Inlining typepun into main (always_inline).
>Estimating body: typepun/3
>Known to be false: not inlined
>size:2 time:2
> Dropping flag_strict_aliasing on main:4 << here is strict_aliasing drop
>   Accounting size:2.00, time:2.00 on predicate:(true)
> 
> However fre1 does not still check for flag_strict_aliasing. Is it bug or not?
> I did an experimental fix, but I guess there will me multiple places where 
> the 
> flag is checked.

Aha, I think the problem is that once we modify strict aliasing of cfun, we need
to re-set the global optimization attributes, becuase those are still copies 
from
the original declaration.
> 
> Second question:
> Current ipa-inline.c code contains question sophisticated 
> check_{match,maybe_up,maybe_down} and
> it's following by:
> 
>   /* When user added an attribute to the callee honor it.  */
>   else if (lookup_attribute ("optimize", DECL_ATTRIBUTES (callee->decl))
>  && opts_for_fn (caller->decl) != opts_for_fn (callee->decl))
>   {
> e->inline_failed = CIF_OPTIMIZATION_MISMATCH;
> inlinable = false;
>   }
> 
> I think it's very strict, as when one uses __attribute__((optimize)) with a 
> param that matches with
> command line arguments, it will return false. What if we remove the hunk?

I am not quite sure.  If you have function A with optimization attribute and 
you inline it to function b
with same command line parameters, it may enable to inline it into function C 
where A would not be inlined
because of the explicit optimization attribute.
If we relax this check, we will need to copy the optimization attribute from A 
to B at the time
we inline into B.
> 
> Thanks,
> Martin

> diff --git a/gcc/tree-ssa-alias.c b/gcc/tree-ssa-alias.c
> index 871fa121fd0..da38e4fab69 100644
> --- a/gcc/tree-ssa-alias.c
> +++ b/gcc/tree-ssa-alias.c
> @@ -986,7 +986,7 @@ ncr_compar (const void *field1_, const void *field2_)
>  static bool
>  nonoverlapping_component_refs_p (const_tree x, const_tree y)
>  {
> -  if (!flag_strict_aliasing
> +  if (!opts_for_fn (current_function_decl)->x_flag_strict_aliasing

This should not be necessary. Einline ought to update global opts.
I suppose when decl modified is cfun one needs to reset current_funcion_decl
to NULL and then set it again so the global flags changing machinery notices 
this
change.

Honza
t
>|| !x || !y
>|| TREE_CODE (x) != COMPONENT_REF
>|| TREE_CODE (y) != COMPONENT_REF)
> @@ -1167,7 +1167,7 @@ indirect_ref_may_alias_decl_p (tree ref1 
> ATTRIBUTE_UNUSED, tree base1,
>  return false;
>  
>/* Disambiguations that rely on strict aliasing rules follow.  */
> -  if (!flag_strict_aliasing || !tbaa_p)
> +  if (!opts_for_fn (current_function_decl)->x_flag_strict_aliasing || 
> !tbaa_p)
>  return true;
>  
>ptrtype1 = TREE_TYPE (TREE_OPERAND (base1, 1));
> @@ -1334,7 +1334,7 @@ indirect_refs_may_alias_p (tree ref1 ATTRIBUTE_UNUSED, 
> tree base1,
>  return false;
>  
>/* Disambiguations that rely on strict aliasing rules follow.  */
> -  if (!flag_strict_aliasing || !tbaa_p)
> +  if (!opts_for_fn (current_function_decl)->x_flag_strict_aliasing || 
> !tbaa_p)
>  return true;
>  
>ptrtype1 = TREE_TYPE (TREE_OPERAND (base1, 1));
> @@ -1508,7 +1508,7 @@ refs_may_alias_p_1 (ao_ref *ref1, ao_ref *ref2, bool 
> tbaa_p)
>  
>/* First defer to TBAA if possible.  */
>if (tbaa_p
> -  && flag_strict_aliasing
> +  && opts_for_fn (current_function_decl)->x_flag_strict_aliasing
>&& !alias_sets_conflict_p (ao_ref_alias_set (ref1),
>ao_ref_alias_set (ref2)))
>  return false;



GNU Tools Cauldron 2017, Prague, September 8-10

2017-04-03 Thread Jan Hubicka
Hi,
we are very pleased to invite you all the GNU Tools Cauldron on 8-10 September
2017.  This year we will meet again in Prague, at Charles University.  Details
are here:

  https://gcc.gnu.org/wiki/cauldron2017

As usual, please register (capacity is limited), send abstracts and ask
administrivia questions to tools-cauldron-ad...@googlegroups.com.

I plan to contact GCC, GDB, binutils, CGEN, DejaGnu, newlib and glibc mailing
lists. Please feel free to share with any other groups that are appropriate.

Looking forward to see you in Prague,
Honza Hubicka


GNU Tools Cauldron 2017, September 8-10, 2017

2017-08-07 Thread Jan Hubicka
Hello,
I would like to remind you of the upcoming GNU Tools Cauldron meeting in Prague
September 8-10.  https://gcc.gnu.org/wiki/cauldron2017 (in 5 weeks!)

At the present we still have 83 registered participants and the room capacity
is bit over 100 seats. If you want to participate, please register soon.  We
also have 35 registered proposals for talks and BoFs which will make it a busy
3 days. (those which were submitted with title and abstract are now available
at https://gcc.gnu.org/wiki/cauldron2017#Abstracts)

Because there will be two tracks, we are still open for last minute talk and
BoF proposals.

If you have a topic that you would like to present, please submit an abstract
describing what you plan to present. We are accepting three types of
submissions:

Prepared presentations: demos, project reports, etc.
BoFs: coordination meetings with other developers.
Tutorials for developers. No user tutorials, please. 

To register your abstract, send e-mail to tools-cauldron-ad...@googlegroups.com

Your submission should contain the following information:

Title:
Authors:
Abstract: 

If you intend to participate, but not necessarily present, please let us know
as well. Send a message to tools-cauldron-ad...@googlegroups.com stating your
intent to participate. Please indicate your affiliation, dietary requirements
and t-shirt size. 

Looking forward to see you soon in Prague,
Jan


Re: .bad_compare in bootstrap of head

2017-05-07 Thread Jan Hubicka
Hi,
I found the problem. It was pretty obvious - we compute sum of times twice.
Once when computing statement sizes and second time by summing the summaries.
because sreal is not distributive, it leads to different results.

I have comitted the following patch.  Incrementally I will drop the
code duplication.

Bootstrapped/regtested x86_64-linux

Honza

--- trunk/gcc/ipa-inline-analysis.c 2017/05/07 19:41:09 247728
+++ trunk/gcc/ipa-inline-analysis.c 2017/05/07 22:21:05 247729
@@ -3119,12 +3120,13 @@
   info->size = info->self_size;
   info->stack_frame_offset = 0;
   info->estimated_stack_size = info->estimated_self_stack_size;
-  if (flag_checking)
-{
-  inline_update_overall_summary (node);
-  gcc_assert (!(info->time - info->self_time).to_int ()
- && info->size == info->self_size);
-}
+
+  /* Code above should compute exactly the same result as
+ inline_update_overall_summary but because computation happens in
+ different order the roundoff errors result in slight changes.  */
+  inline_update_overall_summary (node);
+  gcc_assert (!(info->time - info->self_time).to_int ()
+ && info->size == info->self_size);
 }



Re: .bad_compare in bootstrap of head

2017-05-05 Thread Jan Hubicka
> > On 05/04/2017 08:31 AM, Jeff Law wrote:
> > >On 05/04/2017 07:26 AM, Дмитрий Дьяченко wrote:
> > >>Fedora 26 x86_64
> > >>r247595
> > >>
> > >>~/src/gcc_current/configure --prefix=/usr/local/gcc_current
> > >>--enable-static --enable-checking=no --enable-languages=c,c++,lto
> > >>--enable-plugin --disable-multilib
> > >And that may be the key, I've been able to get a similar failure.  Could
> > >easy be the checking flags.  I'll do some bisecting here.
> > >jeff
> > 
> > It's Jan's patch from April 30.
> > 
> > commit e062e35c7ff7743aa31868370bf5cb9f8da835dd
> > Author: hubicka 
> > Date:   Sun Apr 30 15:02:11 2017 +
> 
> I wonder how that patch can cause mismatches. Does it reproduce on one of
> compile farm machines (my x86-64 bootstrap works fine so does ia64 on terbium
> after fixing the gcc 4.1 issue yeterday).
> It would be great to have -fdump-ipa-inline-details dumps of the mismatching
> run. Perhaps it will be easy to spot the problem.

Aha, --enable-checking=no. I will check what effect it have.

Honza


Re: .bad_compare in bootstrap of head

2017-05-05 Thread Jan Hubicka
> On 05/04/2017 08:31 AM, Jeff Law wrote:
> >On 05/04/2017 07:26 AM, Дмитрий Дьяченко wrote:
> >>Fedora 26 x86_64
> >>r247595
> >>
> >>~/src/gcc_current/configure --prefix=/usr/local/gcc_current
> >>--enable-static --enable-checking=no --enable-languages=c,c++,lto
> >>--enable-plugin --disable-multilib
> >And that may be the key, I've been able to get a similar failure.  Could
> >easy be the checking flags.  I'll do some bisecting here.
> >jeff
> 
> It's Jan's patch from April 30.
> 
> commit e062e35c7ff7743aa31868370bf5cb9f8da835dd
> Author: hubicka 
> Date:   Sun Apr 30 15:02:11 2017 +

I wonder how that patch can cause mismatches. Does it reproduce on one of
compile farm machines (my x86-64 bootstrap works fine so does ia64 on terbium
after fixing the gcc 4.1 issue yeterday).
It would be great to have -fdump-ipa-inline-details dumps of the mismatching
run. Perhaps it will be easy to spot the problem.

Honza

> 
> PR ipa/79224
> * ipa-inline-analysis.c (dump_predicate): Add optional parameter NL.
> (account_size_time): Use two predicates - exec_pred and
> nonconst_pred_ptr.
> (evaluate_conditions_for_known_args): Compute both clause and
> nonspec_clause.
> (evaluate_properties_for_edge): Evaulate both clause and
> nonspec_clause.
> (inline_summary_t::duplicate): Update.
> (estimate_function_body_sizes): Caluculate exec and nonconst
> predicates
> separately.
> (compute_inline_parameters): Likewise.
> (estimate_edge_size_and_time): Update caluclation of time.
> (estimate_node_size_and_time): Compute both time and nonspecialized
> time.
> (estimate_ipcp_clone_size_and_time): Update.
> (inline_merge_summary): Update.
> (do_estimate_edge_time): Update.
> (do_estimate_edge_size): Update.
> (do_estimate_edge_hints): Update.
> (inline_read_section, inline_write_summary): Stream both new
> predicates.
> * ipa-inline.c (compute_uninlined_call_time): Take
> uninlined_call_time
> as argument.
> (compute_inlined_call_time): Cleanup.
> (big_speedup_p): Update.
> (edge_badness): Update.
> * ipa-inline.h (INLINE_TIME_SCALE): Remove.
> (size_time_entry): Replace predicate by exec_predicate and
> nonconst_predicate.
> (edge_growth_cache_entry): Cache both time nad nonspecialized time.
> (estimate_edge_time): Return also nonspec_time.
> (reset_edge_growth_cache): Update.
> 


Re: .bad_compare in bootstrap of head

2017-05-05 Thread Jan Hubicka
> > I wonder how that patch can cause mismatches. Does it reproduce on one of
> > compile farm machines (my x86-64 bootstrap works fine so does ia64 on 
> > terbium
> > after fixing the gcc 4.1 issue yeterday).
> > It would be great to have -fdump-ipa-inline-details dumps of the mismatching
> > run. Perhaps it will be easy to spot the problem.
> 
> Aha, --enable-checking=no. I will check what effect it have.

It does reproduce for me.  Time estimates are very slightly different between
checking and non-checking build.  I am investigating why.

Honza
> 
> Honza


Re: RFC: Improving GCC8 default option settings

2017-09-13 Thread Jan Hubicka
> On Wed, Sep 13, 2017 at 3:21 AM, Michael Clark  wrote:
> >
> >> On 13 Sep 2017, at 1:15 PM, Michael Clark  wrote:
> >>
> >> - https://rv8.io/bench#optimisation
> >> - https://rv8.io/bench#executable-file-sizes
> >>
> >> -O2 is 98% perf of -O3 on x86-64
> >> -Os is 81% perf of -O3 on x86-64
> >>
> >> -O2 saves 5% space on -O3 on x86-64
> >> -Os saves 8% space on -Os on x86-64
> >>
> >> 17% drop in performance for 3% saving in space is not a good trade for a 
> >> “general” size optimisation. It’s more like executable compression.
> >
> > Sorry fixed typo:
> >
> > -O2 is 98% perf of -O3 on x86-64
> > -Os is 81% perf of -O3 on x86-64
> >
> > -O2 saves 5% space on -O3 on x86-64
> > -Os saves 8% space on -O3 on x86-64

I am bit surprised you see only 8% of code size difference
for -Os and -O3.  I look into these numbers occasionally and
it is usualy well over two digit number. 
http://hubicka.blogspot.cz/2014/04/linktime-optimization-in-gcc-2-firefox.html
http://hubicka.blogspot.cz/2014/09/linktime-optimization-in-gcc-part-3.html
has 42% code segment size reduction for Firefox and 19% for libreoffice

Honza


Re: RFC: Improving GCC8 default option settings

2017-09-13 Thread Jan Hubicka
> >I don't see static profile prediction to be very useful here to find
> >"really
> >hot code" - neither in current implementation or future. The problem of
> >-O2 is that we kind of know that only 10% of code somewhere matters for
> >performance but we have no way to reliably identify it.
> 
> It's hard to do better than statically look at (ipa) loop depth. But 
> shouldn't that be good enough? 

Only if you assume that you have whole program and understand indirect calls.
There are some stats on this here
http://ieeexplore.ieee.org/document/717399/

It shows that propagating static profile across whole progrma (which is just
tiny bit more fancy than counting loop depth) sort of work statistically.  I
really do not have very high hopes of this reliably working in production
compiler.  We already have PRs for single function benchmark where deep loop
nest is used ininitialization or so and the actual hard working part has small
loop nest & gets identified as cold.  

As soon as you start propagating in whole program context, such local mistakes
will become more comon.
> 
> >
> >I would make sense to have less agressive vectoriazaoitn at -O2 and
> >more at
> >-Ofast/-O3.
> 
> We tried that but the runtime effects were not offsetting the compile time 
> cost. 

Yep, i remember that.

Honza


Re: RFC: Improving GCC8 default option settings

2017-09-13 Thread Jan Hubicka
> On Wed, Sep 13, 2017 at 3:46 PM, Jakub Jelinek  wrote:
> > On Wed, Sep 13, 2017 at 03:41:19PM +0200, Richard Biener wrote:
> >> On its own -O3 doesn't add much (some loop opts and slightly more
> >> aggressive inlining/unrolling), so whatever it does we
> >> should consider doing at -O2 eventually.
> >
> > Well, -O3 adds vectorization, which we don't enable at -O2 by default.
> 
> As said, -fprofile-use enables it so -O2 should eventually do the same
> for "really hot code".

I don't see static profile prediction to be very useful here to find "really
hot code" - neither in current implementation or future. The problem of
-O2 is that we kind of know that only 10% of code somewhere matters for
performance but we have no way to reliably identify it.

I would make sense to have less agressive vectoriazaoitn at -O2 and more at
-Ofast/-O3.

Adding -Os and -Oz would make sense to me - even with hot/cold info it is not
desriable to optimize as agressively for size as we do becuase mistakes happen
and one do not want to make code paths 1000 times slower to save one byte
of binary.

We could handle this gratefully internally by having logic for "known to be 
cold"
and "guessed to be cold". New profile code can make difference in this.

Honza
> 
> Richard.
> 
> > Jakub


Re: GNU Tools Cauldron 2017: OMP (Offloading and Multi Processing) BoF

2017-09-04 Thread Jan Hubicka
> Hi!
> 
> As Honza told me recently, it has been proposed by Martin -- I don't know
> which one ;-) -- and certainly makes sense, to have another OMP
> (Offloading and Multi Processing) BoF at the GNU Tools Cauldron 2017,
> which is currently scheduled for Saturday, 11:00 to 11:45.  That is, a
> general OMP BoF session in addition to more specific individual
> presentations.
> 
> Honza volunteered me to host it ;-) -- to which I'm not immediately

To be precise, I also volunteered Martin (Jambor) to propose it.
In any case it seems it would be useful thing to have on programme. Let me
know if I should keep it there ;)

Honza
> objecting, but my time to prepare anything at all is very limited
> (non-existing?) right now, and I still got to first organize stuff/myself
> for my other presentation slot...
> 
> We could do it off the cuff, but it'd also be very useful to know
> beforehand a handful of topics to talk about.  So, who's interested in
> contributing to this OMP BoF?  I put some candidates in CC, but
> everyone's welcome to contribute, of course!  I suggest this should
> really be BoF-like: more interaction and discussion instead of too much
> presentation.
> 
> 
> Grüße
>  Thomas


Slides from GNU Tools Cauldron

2017-10-04 Thread Jan Hubicka
Hello,
all but one videos from this year Cauldron has been edited and are now linked
from https://gcc.gnu.org/wiki/cauldron2017 (plugins BoF will appear till end
of week).

I would also like to update the page with links to slides.  If someone beats me
on this and adds some or all of them as attachements to the page, I would be
very happy :)

Honza


Re: Vectorizer size preferences (-mprefer-* on x86)

2018-06-15 Thread Jan Hubicka
> 
> Hi,
> 
> I'm in the process of changing the vectorizer to consider all
> vector sizes as advertised by targetm.autovectorize_vector_sizes
> and to decide which one to use based on its cost model.
> 
> I expect that to make sense for example when choosing between
> AVX128 and AVX256 since the latter may have penalties for
> cross-lane operations which the former lacks.
> 
> Now while the option -mprefer-* and the documentation (to
> some extent) suggests that the user communicates a preference
> the actual implementation in the x86 backend uses -mprefer-*
> to disable the use of larger vector sizes.  That also makes
> targetm.preferred_simd_mode somewhat redundant.
> 
> In the light of using the cost model to decide on the vector
> width used how should we go forward here?
> 
> I would suggest to more clearly document that -mprefer-*
> will disable the use of larger vector sizes and that smaller
> vector sizes might be used if costs tell us to do so (or loops
> do not roll enough).  Implementation-wise the size chosen
> by targetm.preferred_simd_mode could be used as tie-breaker
> in case equal costs are computed.

I suppose one useful meaning of -mpreffer would be to overwrite
the decision heuristics to only consider the mode specified.
(it is what we do at the moment anyway, right?)
So one can use it in the case your heuristics makes wrong choice.
But disabling only bigger vector sizes makes sense to me as well.

Honza
> 
> In the (distant?) future we might want to support mixed-size
> vectorization which makes all this even more complicated.
> Still having -mprefer-* disabling vector sizes makes sense
> here?
> 
> Thanks,
> Richard.


Re: bootstrap failure due to declaration mismatch in r260956

2018-05-30 Thread Jan Hubicka
> On 05/30/2018 12:27 PM, Gerald Pfeifer wrote:
> >On Wed, 30 May 2018, Martin Sebor wrote:
> >>I think your r260956 is missing the following hunk:
> >
> >If this fixes the bootstrap for you (also ran into this myself
> >just now), can you please go ahead and commit?
> >
> >We can always sort out things later, if there are details to be
> >tweaked, but fixing a bootstrap failure with a simple one-liner
> >like this, let's not get process-heavy and just do it. ;-)
> 
> Jakub already committed the missing change in r260970 so boostrap
> should be working again.

I apologize for that. I left svn commit waiting for commit
log entry and did not notice that :(

Thanks for fixing it!
Honza
> 
> Thanks
> Martin
> 


Re: Problems in IPA passes

2017-10-28 Thread Jan Hubicka

Dne 2017-10-28 09:28, Jeff Law napsal:

Jan,

What's the purpose behind calling vrp_meet and
extract_range_from_unary_expr from within the IPA passes?

AFAICT that is not safe to do.  Various paths through those routines
will access static objects within tree-vrp.c which may not be
initialized when IPA runs (vrp_equiv_obstack, vr_value).

While this seems to be working today, it's a failure waiting to happen.

Is there any way you can avoid using those routines?  I can't believe
you really need all the complexity of those routines, particularly
extract_range_from_unary_expr.  Plus it's just downright fugly from a
modularity standpoint.


We now have three value range propagation passes.  The original vrp, 
early vrp
which is fater and ipa-cp which does IPA propagation. It seems sane to 
me for those
three to share some infrastructure but I did not look in enough of 
detail into

these to know if what we have right now is sane thing to do :)

The VRP passes were introduced by Kugan and this part was mostly 
reviewed by

Richi, so I am adding them to CC.
The refactoring happened in 
https://gcc.gnu.org/ml/gcc-patches/2016-07/msg00891.html


Honza



?

Jeff




Re: Resolving LTO symbols for static library boundary

2018-02-05 Thread Jan Hubicka

Dne 2018-02-05 18:44, Richard Biener napsal:

On February 5, 2018 12:26:58 PM GMT+01:00, Allan Sandfeld Jensen
 wrote:

Hello GCC

In trying to make it possible to use LTO for distro-builds of Qt, I
have again
hit the problem of static libraries. In Qt in general we for LTO rely
on a
library boundary, where LTO gets resolved when generating the library
but no
LTO-symbols are exported in the shared library. This ensure the 
library

has a
well defined binary compatible interface and gets LTO optimizations
within
each library. For some private libraries we use static libraries
however,
because we don't need binary compatibility, but though we don't need 
BC


between Qt versions, the libraries should still be linkable with
different gcc
versions (and with different compilers). However when LTO is enabled
the
static libraries will contain definitions that depend on a single gcc
version
making it unsuitable for distribution.

One solution is to enable fat-lto object files for static libraries 
but

that
is both a waste of space and compile time, and disables any LTO
optimization
within the library. Ideally I would like to have the static library do
LTO
optimizations internally just like a shared library, but then exported
as
static library.

I suspect this is more of gcc task than ar/ld task, as it basically
boils down
to gcc doing for a static library what it does for shared library, but
maybe
export the result as a single combined .o file, that can then be ar'ed
into a
compatible static library.

Is this possible?


Hmm. I think you could partially link the static archive contents into
a single relocatable object. Or we could add a mode where you do a
1to1 LTO link of the objects and stop at the ltrans object files. You
could stuff those into an archive again.

I'm not sure how far Honza got partial LTO linking to work?


Parital linking of lto .o files into single non-lto .o file should work 
and it will get you cross-module optimization done. The problem is that 
without resolution info compiler needs to assume that all symbols 
exported by object files are possibly referneced by the later 
incremental link and thus the code quality will definitly not be 
comparable with what you get for LTO on final binary or DSO. Still 
should be better than non-lto build.

I would be curious if it is useful for you in practice.

Honza


Richard.


Best regards
'Allan Jensen




Re: How big (and fast) is going to be GCC 8?

2018-03-12 Thread Jan Hubicka
Hello,
I have also re-done most of my firefox testing similar to ones I published at
http://hubicka.blogspot.cz/2014/04/linktime-optimization-in-gcc-2-firefox.html
(thanks to Martin Liska who got LTO builds to work again)

I am attaching statistics on binary sizes.  Interesting is that for firefox LTO 
is quite
good size optimization (16% on text) and similarly FDO reduces text size and 
combines well
with LTO, which is bit different from Martin's gcc stats. I have looked into 
this very
briefly and one isse seems to be with the way we determine hot/cold threshold.

binary size textrelocations dataEH  
rest
gcc6 -O3904486581288735813720073
13035704257839
gcc6 -O3 -flto  7581078612145211123901858422776 
240002
gcc6 -O3 + FDO  670878241300829413655305
13719944259585
gcc6 -O3 -flto + FDO6020689812169803123341139083088 
240050
gcc7 -O3932334401292883113780313
13578224257408
gcc7 -O3 -flto  7676427412128031124053698420448 
240662
gcc7 -O3 + FDO  675006881299427913650185
13661760263400
gcc7 -O3 -flto + FDO5977699412151360123252178971344 
239501
gcc8 -O2803111201293956813763033
12948752258711
gcc8 -O2 -flto  6915675212109236124758018501152 
240163
gcc8 -O3899136481292446813790393
13374328256867
gcc8 -O3 -flto  7597112212138528124266498593024 
239861
gcc8 -O3 + FDO  670476321299689013707017
13146232263413
gcc8 -O3 -flto + FDO5895141012146008123771618634152 
241765

I also did some builds with clang. Observation is that clang's -O3 binary is
smaller than ours, while our LTO/FDO builds are smaller than clang's (LTO+FDO
build quite substantially).
Our EH is bigger than clang's which is probably something to look into.  One 
problem I am
aware of is that our nothrow pass is not type sensitive and thus won't figure 
out if
program throws an exception of specific type and catches it later.

clang6 -O3  847548481303201813597433
10791528371429
clang6 -O3 -flto9075702412273574122585216841424 
350585
clang6 -O3 -flto=thin   9294057612376724124792337974856 
353171
clang6 -O3 + FDO817768801313642813574489
11501344385123
clang6 -O3 -flto=thin+FDO 88374432  12405075124342979574416 
356508
clang6 -O3 -flto + FDO  9063716812288433122442659023304 
349078

I also did some benchmarking and found at least an issue with -flto -O3 hitting
--param inline-unit-growth bit too early so we do not get much benefits (while
clang does but it also does not reduce binary size). -O3 -flto + FDO or -O2
-flto seems to work well. Will summarize the results later.

Firefox developer Tom Ritter has tested LTO with FDO and without here (it is
rather nice interface - I like that one can click to the graph and see the
results in context of other tests done recently).  This is done with gcc6.

Tracking bug:
https://bugzilla.mozilla.org/show_bug.cgi?format=default=521435

non-FDO build:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-central=try=12ce14a5bcac9975b41a1f901bfc3a8dcb2d791b=1=1=172800

FDO build:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-central=try=7e5bd52e36fcc1703ced01fe87e831a716677295=1=1=172800

Honza


Re: GSOC 2018 - Textual LTO dump tool project

2018-03-06 Thread Jan Hubicka
> On Tue, Mar 6, 2018 at 4:02 PM, Jan Hubicka <hubi...@ucw.cz> wrote:
> >> On Tue, Mar 6, 2018 at 2:30 PM, Hrishikesh Kulkarni
> >> <hrishikeshpa...@gmail.com> wrote:
> >> > Hi,
> >> >
> >> > Thank you Richard and Honza for the suggestions. If I understand 
> >> > correctly,
> >> > the issue is that LTO file format keeps changing per compiler versions, 
> >> > so
> >> > we need a more “stable” representation and the first step for that would 
> >> > be
> >> > to “stabilize” representations for lto-cgraph and symbol table ?
> >>
> >> Yes.  Note the issue is that the current format is a 1:1 representation of
> >> the internal representation -- which means it is the internal 
> >> representation
> >> that changes frequently across releases.  I'm not sure how Honza wants
> >> to deal with those changes in the context of a "stable" IL format.  Given
> >> we haven't been able to provide a stable API to plugins I think it's much
> >> harder to provide a stable streaming format for all the IL details
> >>
> >> > Could you
> >> > please elaborate on what initial steps need to be taken in this regard, 
> >> > and
> >> > if it’s feasible within GSoC timeframe ?
> >>
> >> I don't think it is feasible in the GSoC timeframe (nor do I think it's 
> >> feasible
> >> at all ...)
> >
> > I skipped this, with GSoC timeframe I fully agree.  With feasibility at all 
> > not so
> > much - LLVM documents its bitcode to reasonable extend
> > https://llvm.org/docs/BitCodeFormat.html
> >
> > Reason why i mentioned it is that I would like to use this as an excuse to 
> > get
> > things incrementally cleaned up and it would be nice to keep it in mind 
> > while
> > working on this.
> 
> Ok.  It's probably close enough to what I recommended doing with respect
> to make the LTO bytecode "self-descriptive" -- thus start with making the
> structure documented and parseable without assigning semantics to
> every bit ;)  I think that can be achieved top-down in a very incremental
> way if you get the bottom implemented first (the data-streamer part).

OK :)
I did not mean to document every bit either, at least not for the fancy parts.
It would be nice to have clenned up i.e. the section headers/footers so they
do not depend on endianity and slowly cleanup similar nonsences at higher
levels.  So it may make sense to progress from both directions lower hanging
fruits first.

Honza


Re: GSOC 2018 - Textual LTO dump tool project

2018-03-06 Thread Jan Hubicka
> On Tue, Mar 6, 2018 at 2:30 PM, Hrishikesh Kulkarni
> <hrishikeshpa...@gmail.com> wrote:
> > Hi,
> >
> > Thank you Richard and Honza for the suggestions. If I understand correctly,
> > the issue is that LTO file format keeps changing per compiler versions, so
> > we need a more “stable” representation and the first step for that would be
> > to “stabilize” representations for lto-cgraph and symbol table ?
> 
> Yes.  Note the issue is that the current format is a 1:1 representation of
> the internal representation -- which means it is the internal representation
> that changes frequently across releases.  I'm not sure how Honza wants
> to deal with those changes in the context of a "stable" IL format.  Given
> we haven't been able to provide a stable API to plugins I think it's much
> harder to provide a stable streaming format for all the IL details

Well, because I think it would be good for us to more formalize our IL -
document it properly, remove stuff which is not necessary and get an API+file
representation. Those things are connected to each other and will need work.

If you look how much things chage, it is not very frequent we would change what
is in our CFG (I changed profile this release), how gimple tuples are
represented and what gimple instructions we have.  I think those parts are
resonably well defined. Even if I changed profile this release it is relatively
localized type of change.  I am still more commonly changing symbol table as it
needs to adapt for all LTO details, but I hope to be basically done.

What is more on the go are trees that we will hopefully deal with by defining
gimple types now with early debug done.

What we can do realistically now is to first aim to stream those of better
defined parts in externally parseable sections which do have documentation.  So
far only externally parseable section is the plugin symbol table, but we should
be able to do so with reasonable effort for symbol tables, CFGs and gimple
instruction streams.

In parallel we can incrementally deal with trees mostly hopefully by getting rid
of them (moving symbol names/etc to symbol table so it can live w/o 
declarations,
having gimple types etc.)

> 
> > Could you
> > please elaborate on what initial steps need to be taken in this regard, and
> > if it’s feasible within GSoC timeframe ?
> 
> I don't think it is feasible in the GSoC timeframe (nor do I think it's 
> feasible
> at all ...)
> 
> > Thanks!
> >
> >
> > I am trying to break down the project into milestones for the proposal. So
> > far, I have identified the following objectives:
> >
> > 1] Creating a separate driver, that can read LTO object files. Following
> > Richard’s estimate, I’d leave around first half of the period for this task.
> >
> > Would that be OK ?
> 
> Yes.
Yes, it looks good to me too.
> 
> > Coming to 2nd half:
> >
> > 2] Dumping pass summaries.
> >
> > 3] Stabilizing lto-cgraph and symbol table.
> 
> So I'd instead do
> 
>  3] Enhance the user-interface of the driver
> 
> like providing a way to list all function bodies, a way to dump
> the IL of a single function body, a way to create a dot graph file
> for the cgraph in the file, etc.
> 
> Basically while there's a lot of dumping infrastructure in GCC
> it may not always fit the needs of a LTO IL dumping tool 1:1
> and may need refactoring enhancement.

I would agree here - dumping pass summaries would be nice but we already have
that more or less.  All IPA passes dump their summary into beggining of their
dump file and I find that relatively sufficient to deal with mostly because
summaries are quite simple.  It is much harder to deal with the global sream of
trees and function bodies themselves.

Honza
> 
> Richard.
> 
> >
> > Thanks,
> >
> > Hrishikesh
> >
> >
> >
> > On Fri, Mar 2, 2018 at 6:31 PM, Jan Hubicka <hubi...@ucw.cz> wrote:
> >>
> >> Hello,
> >> > On Fri, Mar 2, 2018 at 10:24 AM, Hrishikesh Kulkarni
> >> > <hrishikeshpa...@gmail.com> wrote:
> >> > > Hello everyone,
> >> > >
> >> > >
> >> > > Thanks for your suggestions and engaging response.
> >> > >
> >> > > Based on the feedback I think that the scope of this project comprises
> >> > > of
> >> > > following three indicative actions:
> >> > >
> >> > >
> >> > > 1. Creating separate driver i.e. separate dump tool that uses lto
> >> > > object API
> >> > > for reading the lto file.
> >> >
> >> > Yes.  I expect this will take th

Re: How big (and fast) is going to be GCC 8?

2018-03-06 Thread Jan Hubicka
> On Tue, Mar 6, 2018 at 11:12 AM, Martin Liška  wrote:
> > Hello.
> >
> > Many significant changes has landed in mainline and will be released as GCC 
> > 8.1.
> > I decided to use various GCC configs we have and test how there 
> > configuration differ
> > in size and also binary size.
> >
> > This is first part where I measured binary size, speed comparison will 
> > follow.
> > Configuration names should be self-explaining, the 'system-*' is built done
> > without bootstrap with my system compiler (GCC 7.3.0). All builds are done
> > on my Intel Haswell machine.
> 
> So from the numbers I see that bootstrap causes a 8% bigger binary compared
> to non-bootstrap using GCC 7.3 at -O2 when including debug info and 1.2%
> larger stripped.  That means trunk generates larger code.

It is bit odd indeed because size stats from specs seems to imply otherwise.
It would be nice to work that out.  Also I am surprised that LTO increases text
size even for non-plugin build. I should not happen.
These issues are generally hard to debug though.  I will try to take a look.

I will send similar stats for my firefox experiments. If you have scripts to 
collect
them, they would be welcome.

Thanks for looking into this!
Honza
> 
> What is missing is a speed comparison of the various binaries -- you could
> try measuring this by doing a make all-gcc for a non-bootstrap config
> (so it uses -O2 -g and doesn't build target libs with the built compiler).
> 
> Richard.
> 
> > Feel free to reply if you need any explanation.
> > Martin


Re: GSOC 2018 - Textual LTO dump tool project

2018-03-06 Thread Jan Hubicka
> On Tue, Mar 6, 2018 at 2:30 PM, Hrishikesh Kulkarni
> <hrishikeshpa...@gmail.com> wrote:
> > Hi,
> >
> > Thank you Richard and Honza for the suggestions. If I understand correctly,
> > the issue is that LTO file format keeps changing per compiler versions, so
> > we need a more “stable” representation and the first step for that would be
> > to “stabilize” representations for lto-cgraph and symbol table ?
> 
> Yes.  Note the issue is that the current format is a 1:1 representation of
> the internal representation -- which means it is the internal representation
> that changes frequently across releases.  I'm not sure how Honza wants
> to deal with those changes in the context of a "stable" IL format.  Given
> we haven't been able to provide a stable API to plugins I think it's much
> harder to provide a stable streaming format for all the IL details
> 
> > Could you
> > please elaborate on what initial steps need to be taken in this regard, and
> > if it’s feasible within GSoC timeframe ?
> 
> I don't think it is feasible in the GSoC timeframe (nor do I think it's 
> feasible
> at all ...)

I skipped this, with GSoC timeframe I fully agree.  With feasibility at all not 
so
much - LLVM documents its bitcode to reasonable extend
https://llvm.org/docs/BitCodeFormat.html

Reason why i mentioned it is that I would like to use this as an excuse to get
things incrementally cleaned up and it would be nice to keep it in mind while
working on this.

Honza
> 
> > Thanks!
> >
> >
> > I am trying to break down the project into milestones for the proposal. So
> > far, I have identified the following objectives:
> >
> > 1] Creating a separate driver, that can read LTO object files. Following
> > Richard’s estimate, I’d leave around first half of the period for this task.
> >
> > Would that be OK ?
> 
> Yes.
> 
> > Coming to 2nd half:
> >
> > 2] Dumping pass summaries.
> >
> > 3] Stabilizing lto-cgraph and symbol table.
> 
> So I'd instead do
> 
>  3] Enhance the user-interface of the driver
> 
> like providing a way to list all function bodies, a way to dump
> the IL of a single function body, a way to create a dot graph file
> for the cgraph in the file, etc.
> 
> Basically while there's a lot of dumping infrastructure in GCC
> it may not always fit the needs of a LTO IL dumping tool 1:1
> and may need refactoring enhancement.
> 
> Richard.
> 
> >
> > Thanks,
> >
> > Hrishikesh
> >
> >
> >
> > On Fri, Mar 2, 2018 at 6:31 PM, Jan Hubicka <hubi...@ucw.cz> wrote:
> >>
> >> Hello,
> >> > On Fri, Mar 2, 2018 at 10:24 AM, Hrishikesh Kulkarni
> >> > <hrishikeshpa...@gmail.com> wrote:
> >> > > Hello everyone,
> >> > >
> >> > >
> >> > > Thanks for your suggestions and engaging response.
> >> > >
> >> > > Based on the feedback I think that the scope of this project comprises
> >> > > of
> >> > > following three indicative actions:
> >> > >
> >> > >
> >> > > 1. Creating separate driver i.e. separate dump tool that uses lto
> >> > > object API
> >> > > for reading the lto file.
> >> >
> >> > Yes.  I expect this will take the whole first half of the project,
> >> > after this you
> >> > should be somewhat familiar with the infrastructure as well.  With the
> >> > existing dumping infrastructure it should be possible to dump the
> >> > callgraph and individual function bodies.
> >> >
> >> > >
> >> > > 2. Extending LTO dump infrastructure:
> >> > >
> >> > > GCC already seems to have dump infrastructure for pretty-printing tree
> >> > > nodes, gimple statements etc. However I suppose we’d need to extend
> >> > > that for
> >> > > dumping pass summaries ? For instance, should we add a new hook say
> >> > > “dump”
> >> > > to ipa_opt_pass_d that’d dump the pass
> >> > > summary ?
> >> >
> >> > That sounds like a good idea indeed.  I'm not sure if this is the most
> >> > interesting
> >> > missing part - I guess we'll find out once a dump tool is available.
> >>
> >> Concering the LTO file format my longer term aim is to make the symbol
> >> table sections (symtab used by lto-plugin as well as the callgraph
> >> section)
> >> and hopefully also the Gimple streams) documented and well behaving
> >> without changing the format in every revision.
> >>
> >> On the other hand the summaries used by individual passes are intended to
> >> be
> >> pass specific and envolving as individula passes become stronger/new
> >> passes
> >> are added.
> >>
> >> It is quite a lot of work to stabilize gimple representation to this
> >> extend,
> >> For callgraph table this is however more realistic. That would mean
> >> to
> >> move some of existing random stuff streamed there into summaries and
> >> additionaly
> >> cleaning up/rewriting lto-cgraph so the on disk format actually makes
> >> sense.
> >>
> >> I will be happy to help with any steps in this direction as well.
> >>
> >> Honza
> >
> >


Re: How can compiler speed-up postgresql database?

2018-03-21 Thread Jan Hubicka
> On 03/21/2018 10:26 AM, Richard Biener wrote:
> >On Tue, Mar 20, 2018 at 8:57 PM, Martin Liška  wrote:
> >>Hi.
> >>
> >>I did similar stats for postgresql server, more precisely for pgbench:
> >>pgbench -s100 & 10 runs of pgbench -t1 -v
> >
> >Without looking at the benchmark probably only because it is flawed
> >(aka not I/O or memory bandwidth limited).  It might have some
> >actual operations on data (regex code?) that we can speed up though.
> 
> Well, it's not ideal as it tests quite simple DB with just couple of tables:
> ```
> By default, pgbench tests a scenario that is loosely based on TPC-B, 
> involving five SELECT, UPDATE, and INSERT commands per transaction.
> ```
> 
> Note that I had pg_data in /dev/shm and I verified that CPU utilization was 
> 100% on a single core.
> That said, it should not be so misleading ;)

Well, it is usually easy to do perf and look how hot spots looks like.
I see similar speedups for common page lading at firefox and similar benchmarks
that are quite good.  Not everything needs to be designed to be memory bound
like spec.

Honza
> 
> Martin
> 
> >
> >Richard.
> >
> >>Martin
> 


Re: GSOC 2018 - Textual LTO dump tool project

2018-03-02 Thread Jan Hubicka
Hello,
> On Fri, Mar 2, 2018 at 10:24 AM, Hrishikesh Kulkarni
>  wrote:
> > Hello everyone,
> >
> >
> > Thanks for your suggestions and engaging response.
> >
> > Based on the feedback I think that the scope of this project comprises of
> > following three indicative actions:
> >
> >
> > 1. Creating separate driver i.e. separate dump tool that uses lto object API
> > for reading the lto file.
> 
> Yes.  I expect this will take the whole first half of the project,
> after this you
> should be somewhat familiar with the infrastructure as well.  With the
> existing dumping infrastructure it should be possible to dump the
> callgraph and individual function bodies.
> 
> >
> > 2. Extending LTO dump infrastructure:
> >
> > GCC already seems to have dump infrastructure for pretty-printing tree
> > nodes, gimple statements etc. However I suppose we’d need to extend that for
> > dumping pass summaries ? For instance, should we add a new hook say “dump”
> > to ipa_opt_pass_d that’d dump the pass
> > summary ?
> 
> That sounds like a good idea indeed.  I'm not sure if this is the most
> interesting
> missing part - I guess we'll find out once a dump tool is available.

Concering the LTO file format my longer term aim is to make the symbol
table sections (symtab used by lto-plugin as well as the callgraph section)
and hopefully also the Gimple streams) documented and well behaving
without changing the format in every revision.

On the other hand the summaries used by individual passes are intended to be
pass specific and envolving as individula passes become stronger/new passes
are added.

It is quite a lot of work to stabilize gimple representation to this extend,
For callgraph table this is however more realistic. That would mean to
move some of existing random stuff streamed there into summaries and additionaly
cleaning up/rewriting lto-cgraph so the on disk format actually makes sense.

I will be happy to help with any steps in this direction as well.

Honza


Enabling vectorization at -O2 for x86 generic, core and zen tuning

2019-01-06 Thread Jan Hubicka
Hello,
while running benchmarks for inliner tuning I also run benchmarks
comparing -O2 and -O2 -ftree-vectorize -ftree-slp-vectorize using Martin
Liska's LNT setup (https://lnt.opensuse.org/).  The results are
summarized below but you can also see also colorful table produced
by Martin's LNT magic

https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report?num_runs=3_percentage_change=0.02=746f%2C55f=IwAR1EhvEnavV5Fg5g404cTrguOXG2cW7b3mRZZvtYn1qy93zihyAanZ7AiWQ
https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report?num_runs=10_percentage_change=0.02=746f%2C55f

Overall we got following SPECrate improvements:

 SPECfp2k6   kabylake generic  +7.15%
 SPECfp2k6   kabylake native   +9.36%
 SPECfp2k17  kabylake generic  +5.36%
 SPECfp2k17  kabylake native   +6.03%
 SPECint2k17 kabylake generic  +4.13%

 SPECfp2k6   zen  generic  +9.98%
 SPECfp2k6   zen  native   +7.04%
 SPECfp2k17  zen  generic  +6.11%
 SPECfp2k17  zen  native   +5.46%
 SPECint2k17 zen  generic  +3.61%
 SPECint2k17 zen  native   +5.18%

The performance results seems surprisingly a lot in favor of
vectorization.  Martin's setup is also checking code size which goes up
by as much 26% on leslie 3d, but since many of benchmarks are small,
this is not very representative for overall code size/compile time costs
of vectorization.

I measured compile time/size on larger programs I have available with
notable changes on DealII, but otherwise sub 1% increases.  I also
benchmarked Firefox but there are no significant differences because
build system already uses -O3 for places where it matters (graphics
library etc.)

   Compile timecode segment size 
Firefox mainlin   in noise 0.8%
gcc from spec2k60.5%   0.6%
gdb 0.8%   0.3%
crafty  0% 0%
DealII  3.2%   4%

Note that I benchmarked -ftree-slp-vectorize separately before and
results was hit/miss, so perhaps enabling only -ftree-vectorize would
give better compile time tradeoffs. I was worried of partial memory
stalls, but I will benchmark it and also benchmark difference between
cost models.

There are some performance regressions, most notably in SPEC
 - exchange (all settings),
 - gamess (all settings),
 - calculix (Zen native only),
 - bwaves (zen native) 
and induct2 on all settings and ffft2 zen only from Polyhedron. Botan
seems very noisy, but it is rather special code.

Exchange can be fixed by adding heuristics that it is bad idea to
vectorize withing loop nest of 10 containing recursive call. I believe
gamess and calculix are understood and i can look into the remaining
cases.

Overall I am surprised how many improvements vectorization at -O2 can do
- clearly more parallel CPUs depends it depends on it.  In my experience
from analyzing regressions of gcc -O2 compared to clang -O2 buids,
vectorization is one of most common reasons. Having gcc -O2 producing
lower SPEC scores and comparably large binaries to clang -O2 does not
feel OK and I think the problem is not limited just to artificial
benchmarks.

Even though it is late in release cycle I wonder if we can do that for
GCC 9?  Performance of vectorization is very architecture specific, I
would propose enabling vectorization for Zen, core based chips and
generic in x86-64. I can also run benchmarks on buldozer. I can then
tune down the cheap model to avoid some of more expensive
transformations.

Honza


Kabylake Spec2k6, generic tuning

  improvements:
SPEC2006/FP/481.wrf -31.33% 
SPEC2006/FP/436.cactusADM   -28.17% 
SPEC2006/FP/437.leslie3d-17.21% 
SPEC2006/FP/434.zeusmp  -12.90% 
SPEC2006/FP/454.calculix-6.44%  
SPEC2006/FP/433.milc-6.03%  
SPEC2006/FP/459.GemsFDTD-4.65%  
SPEC2006/FP/450.soplex  -2.11%  
SPEC2006/INT/403.gcc-6.54%  
SPEC2006/INT/456.hmmer  -5.45%  
SPEC2006/INT/464.h264ref-2.23%  
  regresions:
SPEC2006/FP/416.gamess  8.51%   
SPEC2006/FP/447.dealII  2.73%   

Kabylake spec2k6 -march=native

  improvements:
SPEC2006/FP/436.cactusADM   -45.52% 
SPEC2006/FP/481.wrf -34.13% 
SPEC2006/FP/434.zeusmp  -20.25% 
SPEC2006/FP/437.leslie3d-19.44% 
SPEC2006/FP/459.GemsFDTD-6.85%  
SPEC2006/FP/433.milc-2.15%  
SPEC2006/INT/456.hmmer  -8.97%  
SPEC2006/INT/403.gcc-7.07%  
SPEC2006/INT/464.h264ref-3.00%  
  regressions:
SPEC2006/FP/416.gamess  7.97%   
SPEC2006/INT/483.xalancbmk  3.55%   
SPEC2006/INT/400.perlbench  2.61%   

Kabylake spec2k17 generic tuning

  improvements:
SPEC2017/INT/525.x264_r -33.24% 

Re: Enabling vectorization at -O2 for x86 generic, core and zen tuning

2019-01-07 Thread Jan Hubicka
> > Note that I benchmarked -ftree-slp-vectorize separately before and
> > results was hit/miss, so perhaps enabling only -ftree-vectorize would
> > give better compile time tradeoffs. I was worried of partial memory
> > stalls, but I will benchmark it and also benchmark difference between
> > cost models.
> 
> ; Alias to enable both -ftree-loop-vectorize and -ftree-slp-vectorize.
> ftree-vectorize
> Common Report Optimization
> Enable vectorization on trees.

Thanks! I would probably fall into that trap and run same set of
benchmarks again.

Honza
> 
> -- 
> Eric Botcazou


Re: Enabling vectorization at -O2 for x86 generic, core and zen tuning

2019-01-07 Thread Jan Hubicka
> On Mon, Jan 07, 2019 at 09:29:09AM +0100, Richard Biener wrote:
> > On Sun, 6 Jan 2019, Jan Hubicka wrote:
> > > Even though it is late in release cycle I wonder if we can do that for
> > > GCC 9?  Performance of vectorization is very architecture specific, I
> > > would propose enabling vectorization for Zen, core based chips and
> > > generic in x86-64. I can also run benchmarks on buldozer. I can then
> > > tune down the cheap model to avoid some of more expensive
> > > transformations.
> > 
> > I'd rather not do this now, it's _way_ too late (also considering
> > you are again doing inliner tuning so late).
> 
> This probably should be more generic than just x86 really, we have similar
> problems on Power (-O3 is almost always faster than -O2, which is bad).
> Likely other archs have the same problems.
> 
> But yes, too late for GCC 9.

Yep, I guessed so, still wanted to ask :)
I think this is similar to schedule-insns(2) which is subtarget specific
whether it is a win or not. So I think it is good to leave up to target
to enable the pass - we probably have fewer targets that do want
vectorizing than those we don't.

I would suggest enabling it on x86 early next stage1 and try to do
similar benchmarks on ppc and arm.  We can then try to tune the code
size/speed tradeoffs.

Honza
> 
> 
> Segher


Re: Garbage collection bugs

2019-01-09 Thread Jan Hubicka
> On Wed, Jan 9, 2019 at 12:48 PM Richard Biener
>  wrote:
> >
> > On Wed, Jan 9, 2019 at 10:46 AM Joern Wolfgang Rennecke
> >  wrote:
> > >
> > > We've been running builds/regression tests for GCC 8.2 configured with
> > > --enable-checking=all, and have observed some failures related to
> > > garbage collection.
> > >
> > > First problem:
> > >
> > > The g++.dg/pr85039-2.C tests (I've looked in detail at -std=c++98, but
> > > -std=c++11 and -std=c++14 appear to follow the same pattern) see gcc
> > > garbage-collecting a live vector.  A subsequent access to the vector
> > > with vec_quick_push causes a segmentation fault, as m_vecpfx.m_num is
> > > 0xa5a5a5a5 . The vec data is also being freed / poisoned. The vector in
> > > question is an auto-variable of cp_parser_parenthesized_expression_list,
> > > which is declared as: vec *expression_list;
> > >
> > > According to doc/gty/texi: "you should reference all your data from
> > > static or external @code{GTY}-ed variables, and it is advised to call
> > > @code{ggc_collect} with a shallow call stack."
> > >
> > > In this case, cgraph_node::finalize_function calls the garage collector,
> > > as we are finishing a member function of a struct. gdb shows a backtrace
> > > of 34 frames, which is not really much as far as C++ parsing goes. The
> > > caller of finalize_function is expand_or_defer_fn, which uses the
> > > expression "function_depth > 1" to compute the no_collect paramter to
> > > finalize_function.
> > > cp_parser_parenthesized_expression_list is in frame 21 of the backtrace
> > > at this point.
> > > So, if we consider this shallow, cp_parser_parenthesized_expression_list
> > > either has to refrain from using a vector with garbage-collected
> > > allocation, or it has to make the pointer reachable from a GC root - at
> > > least if function_depth <= 1.
> > > Is the attached patch the right approach?
> > >
> > > When looking at regression test results for gcc version 9.0.0 20181028
> > > (experimental), the excess errors test for g++.dg/pr85039-2.C seems to
> > > pass, yet I can see no definite reason in the source code why that is
> > > so.  I tried running the test by hand in order to check if maybe the
> > > patch for PR c++/84455 plays a role,
> > > but running the test by hand, it crashes again, and gdb shows the
> > > telltale a5 pattern in a pointer register.
> > > #0  vec::quick_push (obj=,
> > >  this=0x705ece60)
> > >  at
> > > /data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/vec.h:974
> > > #1  vec_safe_push (obj=,
> > >  v=@0x7fffd038: 0x705ece60)
> > >  at
> > > /data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/vec.h:766
> > > #2  cp_parser_parenthesized_expression_list (
> > >  parser=parser@entry=0x77ff83f0,
> > >  is_attribute_list=is_attribute_list@entry=0, 
> > > cast_p=cast_p@entry=false,
> > >  allow_expansion_p=allow_expansion_p@entry=true,
> > >  non_constant_p=non_constant_p@entry=0x7fffd103,
> > >  close_paren_loc=close_paren_loc@entry=0x0, wrap_locations_p=false)
> > >  at
> > > /data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/cp/parser.c:7803
> > > #3  0x006e910d in cp_parser_initializer (
> > >  parser=parser@entry=0x77ff83f0,
> > >  is_direct_init=is_direct_init@entry=0x7fffd102,
> > >  non_constant_p=non_constant_p@entry=0x7fffd103,
> > >  subexpression_p=subexpression_p@entry=false)
> > >  at
> > > /data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/cp/parser.c:22009
> > > #4  0x0070954e in cp_parser_init_declarator (
> > >  parser=parser@entry=0x77ff83f0,
> > >  decl_specifiers=decl_specifiers@entry=0x7fffd1c0,
> > >  checks=checks@entry=0x0,
> > > function_definition_allowed_p=function_definition_allowed_p@entry=true,
> > >  member_p=member_p@entry=false, declares_class_or_enum= > > out>,
> > >  function_definition_p=0x7fffd250, maybe_range_for_decl=0x0,
> > >  init_loc=0x7fffd1ac, auto_result=0x7fffd2e0)
> > >  at
> > > /data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/cp/parser.c:19827
> > > #5  0x00711c5d in cp_parser_simple_declaration
> > > (parser=0x77ff83f0,
> > >  function_definition_allowed_p=,
> > > maybe_range_for_decl=0x0)
> > >  at
> > > /data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/cp/parser.c:13179
> > > #6  0x00717bb5 in cp_parser_declaration (parser=0x77ff83f0)
> > >  at
> > > /data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/cp/parser.c:12876
> > > #7  0x0071837d in cp_parser_translation_unit 
> > > (parser=0x77ff83f0)
> > >  at
> > > /data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/cp/parser.c:4631
> > > #8  c_parse_file ()
> > >  at
> > > /data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/cp/parser.c:39108
> > > #9  0x00868db1 in 

Re: Large Fortran program succesfully LTO'd.

2018-09-15 Thread Jan Hubicka
> Honza,
> 
> At the Cauldron meeting last week I mentioned that I wasn't able to compile
> our "small" weather forecasting program with LTO.
> 
> In the mean time I have read some bug reports with "multiple prevailing ..."
> errors, which made me try linking with the 'gold' linker - that worked.
> 
> So the only things building the software that I changed were adding -flto
> and -fuse-ld=gold.
> 
> Some statistics of the source code:
> 
> 3902 files totaling 1081944 lines.
> 
> The result works flawlessly.
> 
> Over the weekend I will study the LTO diagnostics to see what decisions were
> made with respect to inlining and other optimizations.

Great to hear, thanks!   I wonder if it would be possible to have -flto-report 
-fmem-report -Q
output of the link.  I am just trying to get more agressive type merging to 
work but it is
C++ only (because i can make use of the ODR rule here) and I wonder how many 
types you
get in other languages (hope is that C++ is the type heaviest of them all)

I would be interested if performance looks good, too!
Honza
> 
> Thanks !
> 
> -- 
> Toon Moene - e-mail: t...@moene.org - phone: +31 346 214290
> Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
> At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/
> Progress of GNU Fortran: http://gcc.gnu.org/wiki/GFortran#news


Re: GCC 8 vs. GCC 9 speed and size comparison

2019-04-15 Thread Jan Hubicka
> Hi,
> 
> On Mon, 15 Apr 2019, Martin Liška wrote:
> 
> > There's a similar comparison that I did for the official openSUSE gcc 
> > packages. gcc8 is built with PGO, while the gcc9 package is built in 2 
> > different configurations: PGO, LTO, PGO+LTO (LTO used for FE in stage4, 
> > for generators in stage3 as well).
> > 
> > Please take a look at attached statistics.
> 
> It seems the C++ parser got quite a bit slower with gcc 9 :-( Most visible 
> in the compile time for tramp-3d (24%) and kdecore.cc (18% slower with 
> just PGO); it seems that the other .ii files are C-like enough to not 
> trigger this behaviour, so it's probably something to do with templates 
> and/or classes.

Would be possible to have -ftime-report for your tramp3d build with
the four compilers?  It may be a consequence of training changes and
i.e. template instantiation code which is excercised by tramp3d a lot
more than by GCC bootstrap.
But it still seem bit too much for simple PGO divergence. It also
may be caused by changes of inliner decisions or some other pass being
overly active.

Honza
> 
> 
> Ciao,
> Michael.



Re: Expanding roundeven (Was: Re: About GSOC.)

2019-06-22 Thread Jan Hubicka
> Hello.
> I have already sent a patch for expanding roundeven for i386 with
> relevant doubts. I also was regression testing with
> make -k check
> after successful bootstrap build with reverting my patches. Turns out
> do-check fails without any patches applied, Is it ok to do anyways for
> applied patch? Also, /contrib/compare_tests returns with "no common
> files to compare" but I see all as the same files (actually, I am
> moving *.sum files in manual directory and comparing with build of
> applied patch.)? Is it due to failed make check?

The usual procedure is to do make check on clean tree and use
contrib/test_summary to produce summary and then do the same
with patched tree and compare the outputs (by diff)

Honza
> 
> Thanks,
> -Tejas
> 
> 
> On Wed, 19 Jun 2019 at 19:07, Tejas Joshi  wrote:
> >
> > Hello.
> > I have made following changes to inspect inlining of roundeven with
> > the following test code:
> > double
> > plusone (double d)
> > {
> > return __builtin_roundeven (d) + 1;
> > }
> >
> > Running the program using -O2 foo.c gave internal compiler error which
> > I believe is because gcc_unreachable() at:
> >
> > if (TARGET_SSE4_1)
> > emit_insn (gen_sse4_1_round2
> >(operands[0], operands[1], GEN_INT (ROUND_
> >| ROUND_NO_EXC)));
> > I think the following condition matches the criterion? :
> > > I think the code will be much clearer if it explicitly says
> > > ROUND_ROUNDEVEN | ROUND_NO_EXC,
> >
> >  else if (TARGET_64BIT || (mode != DFmode))
> > {
> >   if (ROUND_ == ROUND_FLOOR)
> > ix86_expand_floorceil (operands[0], operands[1], true);
> >   else if (ROUND_ == ROUND_CEIL)
> > ix86_expand_floorceil (operands[0], operands[1], false);
> >   else if (ROUND_ == ROUND_TRUNC)
> > ix86_expand_trunc (operands[0], operands[1]);
> >   else
> > gcc_unreachable ();
> > }
> > in:
> > (define_expand "2"
> > But with -mavx, generated the vroundsd insn. Does it mean ROUNDEVEN
> > should have a condition in the else if, but comments above
> > ix86_expand* functions in i386-expand.c says that those are for SSE2
> > sequences?
> >
> > Thanks,
> > --Tejas
> >
> >
> > On Mon, 17 Jun 2019 at 22:45, Joseph Myers  wrote:
> > >
> > > On Mon, 17 Jun 2019, Tejas Joshi wrote:
> > >
> > > > > existing ROUND_NO_EXC definition in GCC.  A new definition will need
> > > > > adding alongside ROUND_FLOOR, ROUND_CEIL and ROUND_TRUNC to 
> > > > > correspond to
> > > > > rounding to nearest with ties to even, evaluating to 0.)
> > > >
> > > > So (ROUND_ROUNDEVEN   0x0) be declared for rounding towards nearest
> > > > even for rounding mode argument? But reference says that RC field
> > > > should end up as 00B for rounding ties to even? I am also much
> > >
> > > I think the code will be much clearer if it explicitly says
> > > ROUND_ROUNDEVEN | ROUND_NO_EXC, than if it hardcodes implicit knowledge
> > > that 0 is the value used for rounding to nearest with ties to even.
> > >
> > > --
> > > Joseph S. Myers
> > > jos...@codesourcery.com




Re: [RFC] zstd as a compression algorithm for LTO

2019-06-19 Thread Jan Hubicka
> 
> At least allow it to be built as part of the normal build like GMP,
> etc. are done.
> And include it in downloading using contrib/download_prerequisites
> like the libraries are done.

Anoying detail is that zstd builds with cmake, not autotools

Honza
> 
> Thanks,
> Andrew Pinski
> 
> >
> > Richard.
> >
> > >jeff
> >


Re: [RFC] zstd as a compression algorithm for LTO

2019-06-20 Thread Jan Hubicka
> On 6/20/19 12:58 PM, Thomas Koenig wrote:
> > Am 20.06.19 um 11:07 schrieb Martin Liška:
> >> On the contrary, decompression
> >> of zstd with zlib will end with:
> >> lto1: internal compiler error: compressed stream: data error
> > 
> > Sogenerating object files on one system and trying to read them
> > on another system which does not happen to have a particular
> > library installed would lead to failure?  If that's the case,
> > I am not sure that this is a good way of handling things.
> 
> Yes, but LTO bytecode is not supposed to be a distributable format.

In longer term it should be.  We ought to make it host independent and
stable at least within major releases.
I guess it is still OK to make zstd enabled host build require zstd
enabled gcc elsewhere. Just the error message should be more informative
which I think is not hard to do - both zstd and zlip should have
recognizable header.

Other option is to put this into some common place per file.

Honza
> 
> Martin


Re: [PATCH] Add .gnu.lto_.meta section.

2019-06-21 Thread Jan Hubicka
> On 6/21/19 2:34 PM, Richard Biener wrote:
> > On Fri, Jun 21, 2019 at 12:20 PM Martin Liška  wrote:
> >>
> >> Hi.
> >>
> >> The patch is about a new ELF section that will contain information
> >> about LTO version. And for the future, used compression will be stored
> >> here. The patch removes streaming of the version into each section.
> > 
> > I'd like each section to have a header containing a compression method
> > (compressed or not, and now zlib vs. zstd).  We currently have a mix
> > and knowledge is hidden.
> 
> That would be possible, good idea.
> 
> > 
> > I also remember I had old patches to make the data streamer compress
> > the stream on-the-fly, not requiring the full uncompressed stream in
> > memory and then re-access it for compression (needing more temporary
> > memory).  But IIRC the speedup was marginal.

It may be more interesting as memory optimization now we do parallel WPA
streaming BTW.
> > 
> > My thought is also that we should have exactly _one_ ELF section
> > for the LTO bytecode and our own container inside (much like
> > simple-object does for non-ELF).  So having another one is, well, ugly ;)
> 
> Having N sections for all symbols (functions and variables) is handy because
> we read some of the during WPA (by IPA ICF). So having all in a single section
> would make decompression more complicated.

There is nothing preventing us from implementing our own subsections in
that section where each is compressed independently and can be copied to
output file and still save size of headers needed for that.

For example all sections are keyed by the lto section enum + possibly a
decl that can be both streamed as integers rather than quite lenghtly
strings.  (the decl to name is streamed into the global decl stream and
available at all time).

I guess one can go for one global LTO major/minor in the section header
and then  section type/is compressed/size + optional decl ID for each
subsection reducing headers to about 1+8+optional 8 bytes per header.
This can be all placed at the begining of the file followed by RAW data
of individual sections.  I bet this is a lot smaller than ELF overhead :)
> 
> I'm going to prepare a patch that will pull out a LTO section header
> from compressed stream.
This looks like good step (and please stream it in host independent
way). I suppose all these issues can be done one-by-one.

Honza
> 
> Martin
> 
> > 
> >> Disadvantage is a format change that will lead to following errors when
> >> LTO bytecode is used from a different version:
> >>
> >> $ gcc x.o
> >> lto1: fatal error: bytecode stream in file ‘x.o’ generated with GCC 
> >> compiler older than 10.0
> >>
> >> $ gcc-9 main.o
> >> lto1: fatal error: bytecode stream in file ‘main.o’ generated with LTO 
> >> version 850.0 instead of the expected 8.0
> >>
> >> I've been testing the patch.
> >> Thoughts?
> >> Martin
> 


Re: Dropping support of repo files (tlink)

2019-06-21 Thread Jan Hubicka
> > >> I should have been clearer about Darwin:
> > >>
> > >> collect2 is required because it wraps the calling of lto-wrapper and ld.
> > >>
> > >> FWIW Darwin also passes all the “-frepo” testcases, however, I’m not 
> > >> aware of anyone actually
> > >> using case #2 from Jonathan’s post.
> > >>
> > >> So, AFAIK the tlink capability isn’t required for modern C++ on Darwin; 
> > >> but, maybe deprecation is a
> > >> safer step.
> > >
> > > Thank you for the information.
> > >
> > > Yes, I would be fine to deprecate that for GCC 10.1
> >
> > “thinking aloud” - would it work to deprecate (or even disallow immediately 
> > if no-one is using it) LTO + frepo?
> > (so that non-LTO frepo continues as long as anyone needs it)
> 
> Does that even work? ;)  It would need an intermediate compile-step to
> instantiate templates to
> another LTO object.

It is iterating the linker executions so it definitly makes no sense.
One problem is that in the collect2 at the time of deciding whether
to open temporary file for the linker output (which is the problem
Martin tries to solve) we do not know yet whether we will do -flto
(because that will be decided by linker plugin later) or repo
(because it works by looking for .repo files only after the link-step
failed).

I suppose we could block -frepo when -flto is on the command line that
is most common case for LTO builds arriving to bit odd situation that at
link-time -flto will effect nothing but colors of diagnostics
(I would really like to have this solved for gcc 10 in some way)

Honza
> 
> Richard.
> 
> >
> > Iain
> >
> >
> >
> >


Re: [PATCH] Deprecate -frepo option.

2019-06-27 Thread Jan Hubicka
> 
> > On 27 Jun 2019, at 19:21, Jan Hubicka  wrote:
> > 
> >> 
> >> It's useful on targets without COMDAT support.  Are there any such
> >> that we care about at this point?
> >> 
> >> If the problem is the combination with LTO, why not just prohibit that?
> > 
> > The problem is that at the collect2 time we want to decide whether to
> > hold stderr/stdout of the linker. The issue is that we do not know yet
> > if we are going to LTO (because that is decided by linker plugin) or
> > whether we do repo files (because we look for those only after linker
> > failure).
> 
> you could pre-scan for LTO, as is done for the collect2+non-plugin linker.
> (similar to the comment below, that would take some time, but probably
> small c.f the actual link).

the collect2+non-plugin does not handle static archives so it would need
more work while I think we should kill that code (and make darwin to use
lld)

Honza

> Iain
> 
> > We can look for repo files first but that could take some time
> > especially for large programs (probably not too bad compared to actual
> > linking later) or we may require -frepo to be passed to collect2.
> > 
> > Honza
> >> 
> >> Jason
> 


Re: [PATCH] Deprecate -frepo option.

2019-06-27 Thread Jan Hubicka
> 
> It's useful on targets without COMDAT support.  Are there any such
> that we care about at this point?
> 
> If the problem is the combination with LTO, why not just prohibit that?

The problem is that at the collect2 time we want to decide whether to
hold stderr/stdout of the linker. The issue is that we do not know yet
if we are going to LTO (because that is decided by linker plugin) or
whether we do repo files (because we look for those only after linker
failure).

We can look for repo files first but that could take some time
especially for large programs (probably not too bad compared to actual
linking later) or we may require -frepo to be passed to collect2.

Honza
> 
> Jason


Re: Questions about IPA/clones and new LTO pass

2019-12-10 Thread Jan Hubicka
> On Mon, 2019-12-09 at 17:59 -0500, Erick Ochoa wrote:
> > Hello,
> > 
> > this is an update on the LTO pass we've been working on. The
> > optimization is called ipa-initcall-cp because it propagates constant
> > values written to variables with static lifetimes (such as ones
> > initialized in initialization functions).
> [ ... ]
> Just a note, I suspect most of the development community is currently
> focused on stage3 bugfixing rather than new code development.  So
> replies may be limited.
> 
> Jan Hubicka is probably the best person to answer your questions, but
> others may be able to do so as well.

Indeed, I am bit swamped at the moment, but I will try to look at this
at thursday. Sorry for being slow - feel free to ping me about any
IPA/LTO related issues if I respond slowly or not at all.

Honza
> 
> jeff
> 


Re: Fw: GSoC topic: Implement hot cold splitting at GIMPLE IR level

2020-03-03 Thread Jan Hubicka
Aditya,
the hot/cold partitioning is currently organized as folows:

1) we have static branch prediction code in predict.c and profile
   feedback which we store into cfg and callgraph.
2) we have predicates

   optimize_*_for_speed_p/size_p

   where * can be function, basic block, cfg edge, callgraph edge, loop
   or loop nest
3) all optimization passes trading code size for speed should
   the corresponding predicaes about whether to do the transform.
4) ipa-split pass is mostly there to enable partial inlining
5) hot-cold partition in bb-reorder offlines stores
6) we have ipa-icf pass to identify functions and some code in FRE to
   identify code within single function. 
7) we do shrink wrapping to mitigate register pressure problems caused
   in cold regions of code

I think this is bit stronger to what llvm does currently which relies on
outlining SESE regions earlier rather than going through the pain of
implementing support for partitioning.  

Building clang10 with GCC and FDO leads to 37MB .text section
Building clang10 with GCC and LTO+FDO leads to 34MB .text section
Building clang10 with clang10 and FDO leads to 53MB .text section
Building clang10 with clang10 and thinlto+FDO leads to 67MB .text section

GCC built clang is about 2-3% faster building Firefox.

There are many things which I think could/should imporve in our
framework.
 1) our size optimization is very agressive (llvms -Oz) and enaling it
based on heuristics may lead to huge regressions.  We probably want
to have optimize_for_size_p predicate to have two levels and do less
agressive mode in place we are not sure the code is really very
cold.
 2) ipa-split is very simplistic and only splits when there is no value
computed in header of function used in the tail.  We should support
adding extra parameters for values computed and do more general SESE
outlining

Note that we do SESE outlining for openMP but this code is not
interfaced very generically to be easilly used by ipa-split.

Original implementation of ipa-split was kind of "first cut" trying
to clean up interfaces to rest of the compiler and implement more
fancy features later. This never happened so there is certainly
space for imrovements here.

We also do all splitting before actual IPA optimization while it may
be more reasonable to identify potential split points and make IPA
optimization to decide on transforms (currently we rely on inliner
to inline back useless splits).
 3) function partitioning is enabled only for x86. I never had time to
get it working on other targets and no-one picked up this task
 4) ipa-icf and in-function code merging is currently very conservative
(I plan to work on this next stage1) comparing metadata like type
based aliasing info.
 5) we have only very limited logic to detect cold regions without
profile feedback and thus amount of offlined code is very small
(this also is because of 1).
We basically know that code leading to abort/exception handling etc
is cold and consider everything else hot.
 6) We lack code placement pass (though Martin has WIP implementation of
it)
 7) We do no partitioning of data segment which may be also interesting
to do.
 8) Most of non-x86 backends do not implement very well the hot/cold
code models and instruction choice.
 9) Shrink-wrapping and register allocation is not always able to move
spilling to code paths but this is generally very hard problem to
track.

So there are a lot of place for improvmeent (and I am sure more can be
found) and I would be happy to help you with them.

Honza
> 
> Hi Martin,
> Thank you for explaining the status quo. After reading the code of 
> bb-reorder.c,
>  it looks pretty good and seems it doesn't need any significant improvements.
> In that case, the only value GIMPLE level hot/cold splitting could bring is 
> to enable aggressive code-size optimization
> by merging of similar/identical functions: after outlining cold regions, they 
> may be suitable candidates for function merging.
> ipa-split might be enabling some of that, having a region based function 
> splitting could improve ipa-split.
> 
> -Aditya
> 
> 
> --
> From: Martin Liška 
> Sent: Tuesday, March 3, 2020 2:47 AM
> To: Aditya K ; gcc@gcc.gnu.org 
> Cc: Jan Hubicka 
> Subject: Re: GSoC topic: Implement hot cold splitting at GIMPLE IR level
> 
> Hello.
> Thank you for idea. I would like to provide some comments about what GCC can 
> currently
> do and I'm curious we need something extra on top of what we do.
> Right now, GCC can do hot/cold partitioning based on functions and basic 
> blocks. With
> a PGO profile, the optimization is quite aggressive and can save quite some 
> code
> being placed into a cold partitioning and being optimized for size. Without a 
> profile,
> we do a static 

Re: Comparing types at LTO time

2020-01-09 Thread Jan Hubicka
> There doesn't seem to be a way to compare types at LTO time. The functions
> same_type_p and comptypes are front end only if I'm not totally confused
> (which is quite possible) and type_hash_eq doesn't seem to apply for
> structure types. Please, any advice would be welcome.

At LTO time it is bit hard to say what really is the "same type".  We
have multiple notions of equivalence:
 1) TYPE_MAIN_VARIANT (t1) == TYPE_MAIN_VARIANT (t2)
means that both types were equal in tree merging at stream in, this
means that their IL representaiton is identical.

This will lead to "false" for types that are same bud did not get
merged for various reasons. One such valid reason, for example, is
visibility of associated virtual tables
 2) types_types_same_for_odr returns true if types are considered same
by C++ one definition rule.  This is reliable but works only for C++
types with names (structures and unions)
 3) same_type_for_tbaa returns true if types are equivalent for type
based alias analysis.  It returns true in much more cases than 1
but is too coarse if you want to do datastructure changes.

So in general this is quite hard problem (and in fact I started to play
with ODR types originally to understand it better).  I would suggest
starting with 1 if you want to rewrite types and eventually add a new
comparsion once pass does something useful.

Richard may have some extra insights.

Honza
> 
> Thanks,
> 
> Gary Oblock
> 


Re: [EXT] Re: Option processing question

2020-01-14 Thread Jan Hubicka
> On Mon, Jan 13, 2020 at 9:28 PM Gary Oblock  wrote:
> >
> > If I have an optimization set up to run in LTO (call it my_opt) and
> > the flags -flto-partition=one, -flto options and -fmy-opt are set
> > then the optimization might or might not be run depending of whether
> > it can all fit in one partition.
> >
> > What I'm thinking is as long as it's a fatal error detectable anywhere
> > upstream in the compilation to not specify -fmy-opt without 
> > -flto-partition=one
> > then all will be well. So how do I detect it at all and where would I put 
> > the checking?
> 
> I guess you could always check the symbols used_from_other_partition
> and/or in_other_partition flags?  That is, it might be perfectly possible
> for some corner-cases to do the transform independently with more than one
> partition if the partitioning happens to isolate things perfectly?

Well, if you do optimization at ltrans you basically see symbols in
ohter partitions as external so there is not that much of practical
difference in the implementation of analysis except that it will catch
fewer transforms.

But the idea for keeping -flto-partition=none around was to make it
possible to write a prototype of IPA pass that does not nees to play
with partitioning and incrementally add WPA bits.
It seems that this strategy may work here - having a prototype that
works only w/o WHOPR but produces valid code & does something useful
seems like good first step (which is hard enough at its own).

So I would suggest to look at -flto-partition=none first and keep in
mind that eventually we will want a solution that plays well with
partitioning.

Honza
> 
> Richard.
> 
> > Gary
> > ____________
> > From: Richard Biener 
> > Sent: Monday, January 13, 2020 2:30 AM
> > To: Gary Oblock ; Jan Hubicka 
> > Cc: gcc@gcc.gnu.org 
> > Subject: [EXT] Re: Option processing question
> >
> > External Email
> >
> > --
> > On Sat, Jan 11, 2020 at 4:47 AM Gary Oblock  wrote:
> > >
> > > I'm writing an LTO optimization that requires "-flto-partition=one" How 
> > > can I make
> > > sure that this is the case? I've spent hours greping the code and the 
> > > Internals Doc is
> > > worth less than nothing for something like this.
> >
> > That's of course because you shouldn't be doing this ;)
> >
> > > If you have an answer or even
> > > I good idea of where to look please let me know. Note, for an normal 
> > > "-fbalh" there's
> > > a flag_blah that I could look at but for this there seems to be zip, nil, 
> > > diddly squat, etc.
> >
> > At LTRANS time you indeed don't know.
> >
> > But there's -flto-partition=none (none, not one), that you can detect 
> > somehow
> > (I think in_lto_p && !flag_ltrans && !flag_wpa).
> >
> > Richard.
> >
> > > Many thanks,
> > >
> > > Gary


Re: Question about undefined functions' parameters during LTO

2020-03-12 Thread Jan Hubicka
> Hello,
Hello,
> 
> I am trying to find out the arguments of functions which are undefined
> during LTO.
> 
> Basically:
> 
> gcc_assert(in_lto_p && !cnode->definition)
> // Do we have arguments?
> gcc_assert(DECL_ARGUMENTS(cnode->decl)) // fails
> // No, we don't.
> 
> As I understand it, functions which are not defined are ones which have have
> been declared external.

LTO works in a way that function bodies are loaded from the stream on
demand (using node->get_body or node->get_untransformed_body calls).
Declaration of function parameters are part of function body and that is
why they are missing.

You can get all bodies if you experiment with your pass using the call
above. If you want real LTO pass working at whole propgram eventualy you
will need avoid reading all bodies (or WPA stage will be slow and memory
hungry) and store info you need into summaries.

ipa-prop already stores nunber and types of arguments that may be all
you need.
> 
> I believe that, when building an application with -flto, the only functions
> which are not visible during LTO **and** are declared external are functions
> defined in libraries which have not been compiled with -flto. An example of
> this is glibc.
> 
> Indeed, I have implemented an analysis pass in gcc which prints out
> undefined functions, and it prints out the following:
> 
> undefined function __gcov_merge_add
> undefined function fopen
> undefined function printf
> undefined function __builtin_putchar
> undefined function calloc
> undefined function __gcov_merge_topn
> undefined function strtol
> undefined function free
> ... and more
> 
> Now, I am not interested in the bodies of these. I am only interested in
> determining the type of the arguments passed to these functions. However,
> when I call the following function:

See ipa_get_type 

Honza
> 
> ```
> void
> print_parameters_undefined_functions(const cgraph_node *cnode)
> {
>   gcc_assert(cnode);
>   gcc_assert(in_lto_p);
>   gcc_assert(!cnode->definition);
> 
>   tree function = cnode->decl;
>   gcc_assert(function);
>   enum tree_code code = TREE_CODE (function);
>   bool is_function_decl = FUNCTION_DECL == code;
>   gcc_assert (is_function_decl);
> 
>   log("about to print decl_arguments(%s)\n", cnode->name());
>   for (tree parm = DECL_ARGUMENTS (function); parm; parm = DECL_CHAIN(parm))
>   {
> log("hello world\n");
>   }
> ```
> 
> I never see "hello world" but I do see "about to print...".
> Does anyone have any idea on how to obtain the arguments to undefined
> functions?
> 
> The only way I see to do this, is to walk through the gimple instructions,
> find GIMPLE_CALL statements and look at the argument list at that moment.
> But I was wondering if there's a more efficient way to do it.
> 
> Thanks!


Re: Thought on inlining indirect function calls

2020-03-14 Thread Jan Hubicka
> > I think it is because during early opts we do not know if f is not 
> > modified by atoi call and at IPA time (where we already know that from 
> > resolution info) we do not have jump functions to track global variables.
> 
> I confirm that if I remove the calls to atoi, the function is correctly 
> inlined.
> Note however that, if I declare f as static, such as it can theoretically not 
> be
> impacted by atoi, the compiler still doesn't inline the function.

The problem here is that compiler thinks that atoi may call back some
static function in unit that may change value of the variable.  There is
no analysis implemented on how such calls back to unit can affect the
value.  We have leaf attribute that you can annotate atoi (but glibc
does not do that) which  should make gcc to work out the propagation.

Honza
> 
> Yet, if STEP 1 was an issue, and that your answers has provided some 
> clarification,
> what is about the STEP 2? As far as I know, it is not supported, it is?
> 
> Frédéric Recoules
> 
> 
> - Mail original -
> De: "Jan Hubicka" 
> À: "Richard Biener" 
> Cc: "gcc" , "FRÉDÉRIC RECOULES" 
> , mjam...@suse.cz
> Envoyé: Samedi 14 Mars 2020 14:05:07
> Objet: Re: Thought on inlining indirect function calls
> 
> > >I was pretty disappointed to see that even if the compiler knows we are 
> > >calling f_add, it doesn't inline the call (it ends up with "call 
> > >f_add"). 
> > 
> > It's probably because we know it's only called once and thus not 
> > performance relevant. Try put it into a loop. 
> 
> I think it is because during early opts we do not know if f is not 
> modified by atoi call and at IPA time (where we already know that from 
> resolution info) we do not have jump functions to track global variables. 
> 
> Honza 
> > 
> > Richard. 
> > 
> > >I can but only suppose it is because its address is taken and from a 
> > >blind black box user perspective, it doesn't sound too difficult to 
> > >completely inline it. 
> > > 
> > >STEP 2: statically known as being among a pool of less than 
> > > (arbitrarily fixed = 2) N functions 
> > > 
> > >#include  
> > >#include  
> > >#include  
> > > 
> > >int main (int argc, char *argv[]) 
> > >{ 
> > > int x, y, z; 
> > > enum f_e e; 
> > > if (argc < 4) return -1; 
> > > if (strcmp(argv[1], "add") == 0) 
> > > e = ADD; 
> > > else if (strcmp(argv[1], "sub") == 0) 
> > > e = SUB; 
> > > else return -1; 
> > > f_init(e); 
> > > x = atoi(argv[2]); 
> > > y = atoi(argv[3]); 
> > > z = f(x, y); 
> > > printf("%d\n", z); 
> > > return 0; 
> > >} 
> > > 
> > >Here the compiler can't know at compile time the function that will be 
> > >called but I suppose that it knows that it will be either f_add or 
> > >f_sub. 
> > >A simple work around would be for the compiler to test at the call site 
> > >the value of f and inline the call thereafter: 
> > > 
> > > if (f == _add) 
> > > z = f_add(x, y); 
> > > else if (f == _sub) 
> > > z = f_sub(x, y); 
> > > else __builtin_unreachable(); /* or z = f(x, y) to be conservative */ 
> > > 
> > >Once again, this transformation don't sound too complicated to 
> > >implement. 
> > >Still, easy to say-so without diving into the compiler's code. 
> > > 
> > > 
> > >I hope it will assist you in your reflections, 
> > >Have a nice day, 
> > >Frédéric Recoules 
> > 
> 


Re: Thought on inlining indirect function calls

2020-03-14 Thread Jan Hubicka
> >I was pretty disappointed to see that even if the compiler knows we are
> >calling f_add, it doesn't inline the call (it ends up with "call
> >f_add").
> 
> It's probably because we know it's only called once and thus not performance 
> relevant. Try put it into a loop. 

I think it is because during early opts we do not know if f is not
modified by atoi call and at IPA time (where we already know that from
resolution info) we do not have jump functions to track global variables.

Honza
> 
> Richard. 
> 
> >I can but only suppose it is because its address is taken and from a
> >blind black box user perspective, it doesn't sound too difficult to
> >completely inline it.
> >
> >STEP 2: statically known as being among a pool of less than
> >(arbitrarily fixed = 2) N functions
> >
> >#include 
> >#include 
> >#include 
> >
> >int main (int argc, char *argv[])
> >{
> >int x, y, z;
> >enum f_e e;
> >if (argc < 4) return -1;
> >if (strcmp(argv[1], "add") == 0)
> >  e = ADD;
> >else if (strcmp(argv[1], "sub") == 0)
> >  e = SUB;
> >else return -1;
> >f_init(e);
> >x = atoi(argv[2]);
> >y = atoi(argv[3]);
> >z = f(x, y);
> >printf("%d\n", z);
> >return 0;
> >}
> >
> >Here the compiler can't know at compile time the function that will be
> >called but I suppose that it knows that it will be either f_add or
> >f_sub.
> >A simple work around would be for the compiler to test at the call site
> >the value of f and inline the call thereafter:
> >
> >if (f == _add)
> >z = f_add(x, y);
> >else if (f == _sub)
> >z = f_sub(x, y);
> >  else __builtin_unreachable(); /* or z = f(x, y) to be conservative */
> >
> >Once again, this transformation don't sound too complicated to
> >implement.
> >Still, easy to say-so without diving into the compiler's code.
> >
> >
> >I hope it will assist you in your reflections,
> >Have a nice day,
> >Frédéric Recoules
> 


Re: LTO slows down calculix by more than 10% on aarch64

2020-08-31 Thread Jan Hubicka
> Thanks for the suggestions.
> Is it possible to modify assembly files emitted after ltrans phase ?
> IIUC, the linker invokes lto1 twice, for wpa and ltrans,and then links
> the obtained object files which doesn't make it possible to hand edit
> assembly files post ltrans ?
> In particular, I wanted to modify calculix.ltrans16.ltrans.s, which
> contains e_c3d to avoid the extra branch.
> (If that doesn't work out, I can proceed with manually inlining in the
> source and then modifying generated assembly).

It is not intended to work that way, but for smaller benchmark you can
just keep the .s files, modify them and then compile again with gfortran
*.s or so.

Honza
> 
> Thanks,
> Prathamesh
> >
> > Richard.
> >
> > > Thanks,
> > > Prathamesh
> > > >
> > > > > which corresponds to the following loop in line 1014.
> > > > > do n1=1,3
> > > > >   s(iii1,jjj1)=s(iii1,jjj1)
> > > > >  &  +anisox(m1,k1,n1,l1)
> > > > >  &  *w(k1,l1)*vo(i1,m1)*vo(j1,n1)
> > > > >  &  *weight
> > > > >
> > > > > I am not sure why would hoisting have any direct effect on this loop
> > > > > except perhaps that hoisting allocated more reigsters, and led to
> > > > > increased register pressure. Perhaps that's why it's using highered
> > > > > number regs for code-gen in inlined version ? However disabling
> > > > > hoisting in blocks 173 and 181, also leads to overall 6 extra spills
> > > > > (by grepping for str to sp), so
> > > > > hoisting is also helping here ? I am not sure how to proceed further,
> > > > > and would be grateful for suggestions.
> > > > >
> > > > > Thanks,
> > > > > Prathamesh


Re: [GSoC] Automatic Parallel Compilation Viability -- Final Report

2020-08-31 Thread Jan Hubicka
> On Fri, Aug 28, 2020 at 10:32 PM Giuliano Belinassi
>  wrote:
> >
> > Hi,
> >
> > This is the final report of the "Automatic Parallel Compilation
> > Viability" project.  Please notice that this report is pretty
> > similar to the delivered from the 2nd evaluation, as this phase
> > consisted of mostly rebasing and bug fixing.
> >
> > Please, reply this message for any question or suggestion.
> 
> Thank you for your great work Giuliano!

Indeed, it is quite amazing work :)
> 
> It's odd that LTO emulated parallelism is winning here,
> I'd have expected it to be slower.  One factor might
> be different partitioning choices and the other might
> be that the I/O required is faster than the GC induced
> COW overhead after forking.  Note you can optimize
> one COW operation by re-using the main process for
> compiling the last partition.  I suppose you tested
> this on a system with a fast SSD so I/O overhead is
> small?

At the time I implemented fork based parallelism for WPA (which I think
we could recover by bit generalizing guiliano's patches), I had same
outcome: forked ltranses was simply running slower than those after
streaming.  This was however tested on Firefox in my estimate sometime
around 2013. I never tried it on units comparable to insn-emit (which
would be differnt at that time anyway). I was mostly aiming to get it
fully transparent with streaming but never quite finished it since, at
that time, it I tought time is better spent on optimizing LTO data
layout.

I suppose we want to keep both mechanizms in both WPA and normal
compilation and make compiler to choose fitting one.

Honza
> 
> Thanks again,
> Richard.
> 
> > Thank you,
> > Giuliano.
> >
> > --- 8< ---
> >
> > # Automatic Parallel Compilation Viability: Final Report
> >
> > ## Complete Tasks
> >
> > For the third evaluation, we expected to deliver the product as a
> > series of patches for trunk.  The patch series were in fact delivered
> > [1], but several items must be fixed before merge.
> >
> >
> > Overall, the project works and speedups ranges from 0.95x to 3.3x.
> > Bootstrap is working, and therefore this can be used in an experimental
> > state.
> >
> > ## How to use
> >
> > 1. Clone the autopar_devel branch:
> > ```
> > git clone --single-branch --branch devel/autopar_devel \
> >   git://gcc.gnu.org/git/gcc.git gcc_autopar_devel
> > ```
> > 2. Follow the standard compilation options provided in the Compiling
> > GCC page, and install it on some directory. For instance:
> >
> > ```
> > cd gcc_autopar_devel
> > mkdir build && cd build
> > ../configure --disable-bootstrap --enable-languages=c,c++
> > make -j 8
> > make DESTDIR=/tmp/gcc11_autopar install
> > ```
> >
> > 3. If you want to test whether your version is working, just launch
> > Gcc with `-fparallel-jobs=2` when compiling a file with -c.
> >
> > 5. If you want to compile a project with this version it uses GNU
> > Makefiles, you must modify the compilation rule command and prepend a
> > `+` token to it. For example, in Git's Makefile, Change:
> > ```
> > $(C_OBJ): %.o: %.c GIT-CFLAGS $(missing_dep_dirs)
> > $(QUIET_CC)$(CC) -o $*.o -c $(dep_args) $(ALL_CFLAGS) 
> > $(EXTRA_CPPFLAGS) $<
> > ```
> > to:
> > ```
> > $(C_OBJ): %.o: %.c GIT-CFLAGS $(missing_dep_dirs)
> > +$(QUIET_CC)$(CC) -o $*.o -c $(dep_args) $(ALL_CFLAGS) 
> > $(EXTRA_CPPFLAGS) $<
> > ```
> > as well as point the CC variable to the installed gcc, and
> > append a `-fparallel-jobs=jobserver` on your CFLAGS variable.
> >
> > # How the parallelism works in this project
> >
> > In LTO, the Whole Program Analysis decides how to partition the
> > callgraph for running the LTRANS stage in parallel.  This project
> > works very similar to this, however with some changes.
> >
> > The first was to modify the LTO structure so that it accepts
> > the compilation without IR streaming to files.  This avoid an IO
> > overhead when compiling in parallel.
> >
> > The second was to use a custom partitioner to find which nodes
> > should be in the same partition.  This was mainly done to bring COMDAT
> > together, as well as symbols that are part of other symbols, and even
> > private symbols so that we do not output hidden global symbols.
> >
> > However, experiment showed that bringing private symbols together did
> > not yield a interesting speedup on some large files, and therefore
> > we implemented two modes of partitioning:
> >
> > 1. Partition without static promotion. This is the safer method to use,
> > as we do not modify symbols in the Compilation Unit. This may lead to
> > speedups in files that have multiple entries points with low
> > connectivity between then (such as insn-emit.c), however this will not
> > provide speedups when this hypothesis is not true (gimple-match.c is an
> > example of this). This is the default mode.
> >
> > 2. Partition with static promotion to global. This is a more aggressive
> > method, as we can decide to promote some functions to global to increase

Re: [GSoC] Automatic Parallel Compilation Viability -- Final Report

2020-08-31 Thread Jan Hubicka
> > I guess before investigating the current state in detail
> > it might be worth exploring Honzas wish of sharing
> > the actual partitioning code between LTO and -fparallel-jobs.
> > 
> > Note that larger objects take a bigger hit from the GC COW
> > issue so at some point that becomes dominant because the
> > first GC walk for each partition is the same as doing a GC
> > walk for the whole object.  Eventually it makes sense to
> > turn off GC completely for smaller partitions.
> 
> Just a side note, I added a ggc_collect () before start forking
> and it did not improve things.

We was discussing it on IRC. Your problem is similar to what precompiled
headers are hitting, too. They essentially make a memory dump of GGC
memory that is mmaped back at the time the compiler recognizes corresponding
#include.

There is mechanism in garbage collector to push and pop context which
was partly removed in meantime.  Pushing context makes all data in
garbage collector permanent until you popo it again and the marking is
not supposed to dirtify the pages.  It is still done for PCH, see 
ggc_pch_read and the line setting contxt_depth to 1.

In theory setting context-depth to 1 before forking and fixing paces we
dirtify the pages should help both the fork based parallelism and the
PCHs. As richi mentioned the ggc seems to set in_use_p anyway that may
trigger unwanted COW.

I would also trim memory pool and malloc before forking, that should be
cheap and work. It is already done by WPA before fork as well.

Honza
> 
> Thank you,
> Giuliano.
> 
> > 
> > Richard.
> > 
> > > Honza
> > > >
> > > > Thanks again,
> > > > Richard.
> > > >
> > > > > Thank you,
> > > > > Giuliano.
> > > > >
> > > > > --- 8< ---
> > > > >
> > > > > # Automatic Parallel Compilation Viability: Final Report
> > > > >
> > > > > ## Complete Tasks
> > > > >
> > > > > For the third evaluation, we expected to deliver the product as a
> > > > > series of patches for trunk.  The patch series were in fact delivered
> > > > > [1], but several items must be fixed before merge.
> > > > >
> > > > >
> > > > > Overall, the project works and speedups ranges from 0.95x to 3.3x.
> > > > > Bootstrap is working, and therefore this can be used in an 
> > > > > experimental
> > > > > state.
> > > > >
> > > > > ## How to use
> > > > >
> > > > > 1. Clone the autopar_devel branch:
> > > > > ```
> > > > > git clone --single-branch --branch devel/autopar_devel \
> > > > >   git://gcc.gnu.org/git/gcc.git gcc_autopar_devel
> > > > > ```
> > > > > 2. Follow the standard compilation options provided in the Compiling
> > > > > GCC page, and install it on some directory. For instance:
> > > > >
> > > > > ```
> > > > > cd gcc_autopar_devel
> > > > > mkdir build && cd build
> > > > > ../configure --disable-bootstrap --enable-languages=c,c++
> > > > > make -j 8
> > > > > make DESTDIR=/tmp/gcc11_autopar install
> > > > > ```
> > > > >
> > > > > 3. If you want to test whether your version is working, just launch
> > > > > Gcc with `-fparallel-jobs=2` when compiling a file with -c.
> > > > >
> > > > > 5. If you want to compile a project with this version it uses GNU
> > > > > Makefiles, you must modify the compilation rule command and prepend a
> > > > > `+` token to it. For example, in Git's Makefile, Change:
> > > > > ```
> > > > > $(C_OBJ): %.o: %.c GIT-CFLAGS $(missing_dep_dirs)
> > > > > $(QUIET_CC)$(CC) -o $*.o -c $(dep_args) $(ALL_CFLAGS) 
> > > > > $(EXTRA_CPPFLAGS) $<
> > > > > ```
> > > > > to:
> > > > > ```
> > > > > $(C_OBJ): %.o: %.c GIT-CFLAGS $(missing_dep_dirs)
> > > > > +$(QUIET_CC)$(CC) -o $*.o -c $(dep_args) $(ALL_CFLAGS) 
> > > > > $(EXTRA_CPPFLAGS) $<
> > > > > ```
> > > > > as well as point the CC variable to the installed gcc, and
> > > > > append a `-fparallel-jobs=jobserver` on your CFLAGS variable.
> > > > >
> > > > > # How the parallelism works in this project
> > > > >
> > > > > In LTO, the Whole Program Analysis decides how to partition the
> > > > > callgraph for running the LTRANS stage in parallel.  This project
> > > > > works very similar to this, however with some changes.
> > > > >
> > > > > The first was to modify the LTO structure so that it accepts
> > > > > the compilation without IR streaming to files.  This avoid an IO
> > > > > overhead when compiling in parallel.
> > > > >
> > > > > The second was to use a custom partitioner to find which nodes
> > > > > should be in the same partition.  This was mainly done to bring COMDAT
> > > > > together, as well as symbols that are part of other symbols, and even
> > > > > private symbols so that we do not output hidden global symbols.
> > > > >
> > > > > However, experiment showed that bringing private symbols together did
> > > > > not yield a interesting speedup on some large files, and therefore
> > > > > we implemented two modes of partitioning:
> > > > >
> > > > > 1. Partition without static promotion. This is the safer method to 
> > > > > use,
> > > > > as we do not 

Re: Where did my function go?

2020-10-20 Thread Jan Hubicka
> On Tue, Oct 20, 2020 at 1:02 PM Martin Jambor  wrote:
> >
> > Hi,
> >
> > On Tue, Oct 20 2020, Richard Biener wrote:
> > > On Mon, Oct 19, 2020 at 7:52 PM Gary Oblock  
> > > wrote:
> > >>
> > >> Richard,
> > >>
> > >> I guess that will work for me. However, since it
> > >> was decided to remove an identical function,
> > >> why weren't the calls to it adjusted to reflect it?
> > >> If the call wasn't transformed that means it will
> > >> be mapped at some later time. Is that mapping
> > >> available to look at? Because using that would
> > >> also be a potential solution (assuming call
> > >> graph information exists for the deleted function.)
> > >
> > > I'm not sure how the transitional cgraph looks like
> > > during WPA analysis (which is what we're talking about?),
> > > but definitely the IL is unmodified in that state.
> > >
> > > Maybe Martin has an idea.
> > >
> >
> > Exactly, the cgraph_edges is where the correct call information is
> > stored until the inlining transformation phase calls
> > cgraph_edge::redirect_call_stmt_to_callee is called on it - inlining is
> > a special pass in this regard that performs this IPA-infrastructure
> > function in addition to actual inlining.
> >
> > In cgraph means the callee itself but also information in
> > e->callee->clone.param_adjustments which might be interesting for any
> > struct-reorg-like optimizations (...and in future possibly in other
> > transformation summaries).
> >
> > The late IPA passes are in very unfortunate spot here since they run
> > before the real-IPA transformation phases but after unreachable node
> > removals and after clone materializations and so can see some but not
> > all of the changes performed by real IPA passes.  The reason for that is
> > good cache locality when late IPA passes are either not run at all or
> > only look at small portion of the compilation unit.  In such case IPA
> > transformations of a function are followed by all the late passes
> > working on the same function.
> >
> > Late IPA passes are unfortunately second class citizens and I would
> > strongly recommend not to use them since they do not fit into our
> > otherwise robust IPA framework very well.  We could probably provide a
> > mechanism that would allow late IPA passes to run all normal IPA
> > transformations on a function so they could clearly see what they are
> > looking at, but extensive use would slow compilation down so its use
> > would be frowned upon at the very least.
> 
> So IPA PTA does get_body () on the nodes it wants to analyze and I
> thought that triggers any pending IPA transforms?

Yes, it does (and get_untransormed_body does not)

Honza
> 
> Richard.
> 
> > Martin
> >


Re: Where did my function go?

2020-10-20 Thread Jan Hubicka
> > On Tue, Oct 20, 2020 at 1:02 PM Martin Jambor  wrote:
> > >
> > > Hi,
> > >
> > > On Tue, Oct 20 2020, Richard Biener wrote:
> > > > On Mon, Oct 19, 2020 at 7:52 PM Gary Oblock  
> > > > wrote:
> > > >>
> > > >> Richard,
> > > >>
> > > >> I guess that will work for me. However, since it
> > > >> was decided to remove an identical function,
> > > >> why weren't the calls to it adjusted to reflect it?
> > > >> If the call wasn't transformed that means it will
> > > >> be mapped at some later time. Is that mapping
> > > >> available to look at? Because using that would
> > > >> also be a potential solution (assuming call
> > > >> graph information exists for the deleted function.)
> > > >
> > > > I'm not sure how the transitional cgraph looks like
> > > > during WPA analysis (which is what we're talking about?),
> > > > but definitely the IL is unmodified in that state.
> > > >
> > > > Maybe Martin has an idea.
> > > >
> > >
> > > Exactly, the cgraph_edges is where the correct call information is
> > > stored until the inlining transformation phase calls
> > > cgraph_edge::redirect_call_stmt_to_callee is called on it - inlining is
> > > a special pass in this regard that performs this IPA-infrastructure
> > > function in addition to actual inlining.
> > >
> > > In cgraph means the callee itself but also information in
> > > e->callee->clone.param_adjustments which might be interesting for any
> > > struct-reorg-like optimizations (...and in future possibly in other
> > > transformation summaries).
> > >
> > > The late IPA passes are in very unfortunate spot here since they run
> > > before the real-IPA transformation phases but after unreachable node
> > > removals and after clone materializations and so can see some but not
> > > all of the changes performed by real IPA passes.  The reason for that is
> > > good cache locality when late IPA passes are either not run at all or
> > > only look at small portion of the compilation unit.  In such case IPA
> > > transformations of a function are followed by all the late passes
> > > working on the same function.
> > >
> > > Late IPA passes are unfortunately second class citizens and I would
> > > strongly recommend not to use them since they do not fit into our
> > > otherwise robust IPA framework very well.  We could probably provide a
> > > mechanism that would allow late IPA passes to run all normal IPA
> > > transformations on a function so they could clearly see what they are
> > > looking at, but extensive use would slow compilation down so its use
> > > would be frowned upon at the very least.
> > 
> > So IPA PTA does get_body () on the nodes it wants to analyze and I
> > thought that triggers any pending IPA transforms?
> 
> Yes, it does (and get_untransormed_body does not)
And to bit correct Maritn's explanation: the late IPA passes are
intended to work, though I was mostly planning them for prototyping true
ipa passes and also possibly for implementing passes that inspect only
few functions.

IPA transforms happens when get_body is called.  With LTO this also
trigger reading the body from disk.  So if you want to see all bodies
and work on them, you can simply call get_body on everything but it will
result in increased memory use since everything will be loaded form disk
and expanded (by inlining) at once instead of doing it on per-function
basis.

get_body is simply mean to arrange the body on demand.  The passmanager
uses it before late passes are executed and ipa-pta uses it before it
builds constraints (that is not good for reasons described above).

Clone materialization is also triggered by get_body. The clone
materialization pass mostly happens to remove unreachable function
bodies. I plan to get rid of it, since as we are now better on doing ipa
transforms it brings in a lot of bodies already. For cc1plus it is well
over 1GB of memory.

Honza
> 
> Honza
> > 
> > Richard.
> > 
> > > Martin
> > >


Re: Git rejecting branch merge

2020-09-29 Thread Jan Hubicka
> On Tue, Sep 29, 2020 at 9:17 AM Jan Hubicka  wrote:
> >
> > Hello,
> > I am trying to update me/honza-gcc-benchmark-branch to current trunk
> > which I do by deleting it, recreating locally and pushing out.
> >
> > The problem is that the push fails witih:
> >
> > remote: *** The following commit was rejected by your 
> > hooks.commit-extra-checker script (status: 1)
> > remote: *** commit: 03e87724864a17e22c9b692cc0caa014e9dba6b1
> > remote: *** The first line of a commit message should be a short
> > description of the change, not a single word.
> > remote: error: hook declined to update
> >
> > The broken commit is:
> >
> > $ git show 03e87724864a17e22c9b692cc0caa014e9dba6b1
> >
> > commit 03e87724864a17e22c9b692cc0caa014e9dba6b1
> > Author: Georg-Johann Lay 
> > Date:   Tue Jan 14 11:09:38 2020 +0100
> 
> It's odd that your branch contains a change as old as this?

I use it for benchmarking on lnt and it is indeed usued for some time.
> 
> Maybe you can rebase onto sth more recent.

That is what I am trying to do. But my understanding is that in meantime
the hook checking for more than one word in description was implemented
and thus I can't.

Perhaps I can just remove the remote branch and create again?

Honza
> 
> >
> > Typo.
> > libgcc/
> > * config/avr/lib1funcs.S (skip): Simplify.
> >
> >
> > I wonder is there a way to get this done?
> >
> > Thanks,
> > Honza


Re: Git rejecting branch merge

2020-09-29 Thread Jan Hubicka
> On Tue, 29 Sep 2020, Joel Brobecker wrote:
> 
> > > > That's correct. The commit-extra-checker is called using the same list
> > > > of "added commits" as the other checks implemented in the hooks, and
> > > > that list excludes all commits accessible from existing references
> > > > in the repository.
> > > 
> > > Since 03e87724864a17e22c9b692cc0caa014e9dba6b1 has been in the repository 
> > > (on master) since before GCC 10 branched, something must be going wrong 
> > > for a push to be rejected based on a check of that commit.
> > 
> > OK. Can you create a tarball of the GCC's bare repository as it is now,
> > and give me access to both this tarball and the branch that the user
> > was trying to update, I can try to spend a bit of time this weekend
> > trying to reproduce and then investigating it. And just for the avoidance
> > of doubt, if I could get the git command that was used to attempt
> > the push, this would avoid wasting time investigating the wrong thing.
> 
> It's /git/gcc.git on sourceware (I think you have shell access, or can git 
> clone --mirror to get a full bare copy).  
> refs/users/hubicka/heads/honza-gcc-benchmark-branch currently points to 
> c478047c0fd71e8bd8e069c729b57a89b75ee004, "Add changelog".  I don't know 
> exactly what Honza is pushing there (whether it's a merge or a rebase), 
> but whatever he's pushing, the hook should not be checking commits that 
> are already in the repository.

I did the following (and perhaps you can do it for me)

git branch -D me/honza-gcc-benchmark-branch
git checkout master
git co -b me/honza-gcc-benchmark-branch
git push -f

Which I hope should replace the existing (old) branch by a copy of
current trunk on which I could do more of tuning tests.

Honza
> 
> -- 
> Joseph S. Myers
> jos...@codesourcery.com


Re: Git rejecting branch merge

2020-09-29 Thread Jan Hubicka
> On Tue, Sep 29, 2020 at 07:16:45PM +0200, Jan Hubicka wrote:
> > > On Tue, 29 Sep 2020, Joel Brobecker wrote:
> > > 
> > > > > > That's correct. The commit-extra-checker is called using the same 
> > > > > > list
> > > > > > of "added commits" as the other checks implemented in the hooks, and
> > > > > > that list excludes all commits accessible from existing references
> > > > > > in the repository.
> > > > > 
> > > > > Since 03e87724864a17e22c9b692cc0caa014e9dba6b1 has been in the 
> > > > > repository 
> > > > > (on master) since before GCC 10 branched, something must be going 
> > > > > wrong 
> > > > > for a push to be rejected based on a check of that commit.
> > > > 
> > > > OK. Can you create a tarball of the GCC's bare repository as it is now,
> > > > and give me access to both this tarball and the branch that the user
> > > > was trying to update, I can try to spend a bit of time this weekend
> > > > trying to reproduce and then investigating it. And just for the 
> > > > avoidance
> > > > of doubt, if I could get the git command that was used to attempt
> > > > the push, this would avoid wasting time investigating the wrong thing.
> > > 
> > > It's /git/gcc.git on sourceware (I think you have shell access, or can 
> > > git 
> > > clone --mirror to get a full bare copy).  
> > > refs/users/hubicka/heads/honza-gcc-benchmark-branch currently points to 
> > > c478047c0fd71e8bd8e069c729b57a89b75ee004, "Add changelog".  I don't know 
> > > exactly what Honza is pushing there (whether it's a merge or a rebase), 
> > > but whatever he's pushing, the hook should not be checking commits that 
> > > are already in the repository.
> > 
> > I did the following (and perhaps you can do it for me)
> > 
> > git branch -D me/honza-gcc-benchmark-branch
> > git checkout master
> > git co -b me/honza-gcc-benchmark-branch
> > git push -f
> > 
> > Which I hope should replace the existing (old) branch by a copy of
> > current trunk on which I could do more of tuning tests.
> 
> So, IIUC, you're trying to replace the old
> refs/users/hubicka/heads/honza-gcc-benchmark-branch
> with whatever is on master at this moment, is that correct
> (that would wipe whatever changes you've made in your old
> branch)?
> 
> For instance, currently master is pointing to commit
> adcf8a11c772e7a0c64d4ae3eb19a520566f32b9.

Yes, i want to replace it with current master.

Thanks,
Honza
> 
> -- 
> Joel


Re: Git rejecting branch merge

2020-10-01 Thread Jan Hubicka
> > So, IIUC, you're trying to replace the old
> > refs/users/hubicka/heads/honza-gcc-benchmark-branch
> > with whatever is on master at this moment, is that correct
> > (that would wipe whatever changes you've made in your old
> > branch)?
> > 
> > For instance, currently master is pointing to commit
> > adcf8a11c772e7a0c64d4ae3eb19a520566f32b9.
> 
> Yes, i want to replace it with current master.

Hello,
I wonder I can get the branch moved, so I can do the benchmarking :)
Any suggestions how to do that?

Thanks,
Honza
> 
> Thanks,
> Honza
> > 
> > -- 
> > Joel


Git rejecting branch merge

2020-09-29 Thread Jan Hubicka
Hello,
I am trying to update me/honza-gcc-benchmark-branch to current trunk
which I do by deleting it, recreating locally and pushing out.

The problem is that the push fails witih:

remote: *** The following commit was rejected by your 
hooks.commit-extra-checker script (status: 1)
remote: *** commit: 03e87724864a17e22c9b692cc0caa014e9dba6b1
remote: *** The first line of a commit message should be a short
description of the change, not a single word.
remote: error: hook declined to update

The broken commit is:

$ git show 03e87724864a17e22c9b692cc0caa014e9dba6b1

commit 03e87724864a17e22c9b692cc0caa014e9dba6b1
Author: Georg-Johann Lay 
Date:   Tue Jan 14 11:09:38 2020 +0100

Typo.
libgcc/
* config/avr/lib1funcs.S (skip): Simplify.


I wonder is there a way to get this done?

Thanks,
Honza


Re: GCC usage script

2020-07-07 Thread Jan Hubicka
> Hey.
> 
> Some of you may know Honza's images about CPU utilization and memory usage
> when using LTO [1]. Last week I played with Chromium and Firefox and I wanted
> to visualize their utilization. That's why I came up with a new script [2].
> 
> You can easily wrap a command and the script generates graphs for you:.
> The script uses modern psutil library and plots through a great matplotlib:
> $ usage-wrapper.py -v -t postgresql 'make -j16'
> ...
> $ eog usage.svg
> 
> There's a gallery of the collected reports for Firefox and Chromium:
> https://gist.github.com/marxin/223890df4d8d8e490b6b2918b77dacad

Thanks, it looks nice.
I am not sure what the blue line means - if it is overall usage of
vmstat then I guess you do not calculate COW of the stream out processes
correctly and that is why you get so high peak on them.
> 
> and yes, we have a LTO regression since GCC 9 that I'm going to take closer 
> to.

Where do you see the regression? (just trying to make sense of the
numbers).
For Firefox I see you get around 16GB streaming memory peak and 32GB for
ltranses which seems kind of acceptable for such a large parallelism.

Do you have time report for Cromium WPA? It seems to be stuck on
something for quite a while.
I would say that simple time report + ideally perf profile should make
it possible to track this down.

Honza
> 
> Martin
> 
> [1] 
> https://4.bp.blogspot.com/-fvrkYSMBqug/XMsTqg4HEkI/Gl8/8sp1GWv6Oe8tfBfL6aO8Nbq5j3hExurpwCEwYBhgL/s1600/llvm.png
> [2] https://github.com/marxin/script-misc/blob/master/usage-wrapper.py


Re: gcc does not reduce the function call to the result if called function is not static when using -O2, only with -O3, clang and msvc do the optimization also with -O2

2020-12-05 Thread Jan Hubicka
> gcc does not reduce to call result if called function is not static in
> -O2 (will do with -O2)
> clang and msvc does it also in -O2 regardless of the function beeing
> static or not
> 
> can someone explain to me why the -O2 optimizer is not able(allowed) to
> reduce this small sample the same way as clang/msvc?

GCC is optimizing for size here, so it is more careful with inlning.
This is because it knows that function main is called once.  If you
rename main to something else or add a loop around the call, then it
will inline for speed and do same work as clang.  Clang (and probaby
msvc) does not implement the heuristics that certain functions
(static constructors, destructors, main and noreturns) are called once
so they probably both optimize for speed.

Even when optimizing for size it would be good idea to inline.  However
the inliner heruistics predicts that it is not.  This is because at the
inlining time compiler does not see that calee will optimize to constant.
The reason is that you store the temporary vlues to array and those are
not tracked. If you used scalar variables it would be able to constant
fold everything early.

Handling this would require either recovering early ipa-sra or adding
return functions for values passed by reference.

Honza


Re: Detect EAF flags in ipa-modref

2020-11-09 Thread Jan Hubicka
Hi,
this is updated patch for autodetection of EAF flags.  Still the goal is
to avoid fancy stuff and get besic logic in place (so no dataflow, no IPA
propagation, no attempts to handle trickier cases).  There is one new failure

./gcc/testsuite/gcc/gcc.sum:FAIL: gcc.dg/sso/t2.c   -Wno-scalar-storage-order 
-O1 -fno-inline  output pattern test
./gcc/testsuite/gcc/gcc.sum:FAIL: gcc.dg/sso/t2.c   -Wno-scalar-storage-order 
-O2  output pattern test
./gcc/testsuite/gcc/gcc.sum:FAIL: gcc.dg/sso/t2.c   -Wno-scalar-storage-order 
-Os  output pattern test

Which I blieve is bug exposed by detecting dump function to be EAF_DIRECT and
NOESCAPE (which it is) and packing/updacking the bitfields leads in one bit
difference. Still no idea why.

Patch seems to be quite effective on cc1plus turning:

Alias oracle query stats:
  refs_may_alias_p: 65808750 disambiguations, 75664890 queries
  ref_maybe_used_by_call_p: 153485 disambiguations, 66711204 queries
  call_may_clobber_ref_p: 22816 disambiguations, 28889 queries
  nonoverlapping_component_refs_p: 0 disambiguations, 36846 queries
  nonoverlapping_refs_since_match_p: 27271 disambiguations, 58917 must 
overlaps, 86958 queries
  aliasing_component_refs_p: 65808 disambiguations, 2067256 queries
  TBAA oracle: 25929211 disambiguations 60395141 queries
   12391384 are in alias set 0
   10783783 queries asked about the same object
   126 queries asked about the same alias set
   0 access volatile
   9598698 are dependent in the DAG
   1691939 are aritificially in conflict with void *

Modref stats:
  modref use: 14284 disambiguations, 53336 queries
  modref clobber: 1660281 disambiguations, 2130440 queries
  4311165 tbaa queries (2.023603 per modref query)
  685304 base compares (0.321673 per modref query)

PTA query stats:
  pt_solution_includes: 959190 disambiguations, 13169678 queries
  pt_solutions_intersect: 1050969 disambiguations, 13246686 queries

Alias oracle query stats:
  refs_may_alias_p: 66914578 disambiguations, 76692648 queries
  ref_maybe_used_by_call_p: 244077 disambiguations, 67732086 queries
  call_may_clobber_ref_p: 111475 disambiguations, 116613 queries
  nonoverlapping_component_refs_p: 0 disambiguations, 37091 queries
  nonoverlapping_refs_since_match_p: 27267 disambiguations, 59019 must 
overlaps, 87056 queries
  aliasing_component_refs_p: 65870 disambiguations, 2063394 queries
  TBAA oracle: 26024415 disambiguations 60579490 queries
   12450910 are in alias set 0
   10806673 queries asked about the same object
   125 queries asked about the same alias set
   0 access volatile
   9605837 are dependent in the DAG
   1691530 are aritificially in conflict with void *

Modref stats:
  modref use: 14272 disambiguations, 277680 queries
  modref clobber: 1669753 disambiguations, 7849135 queries
  4330162 tbaa queries (0.551674 per modref query)
  699241 base compares (0.089085 per modref query)

PTA query stats:
  pt_solution_includes: 1833920 disambiguations, 13846032 queries
  pt_solutions_intersect: 1093785 disambiguations, 13309954 queries

So almost twice as many pt_solution_includes disambiguations.
I will re-run the stats overnight to be sure that it is not independent
change (but both build was from almost same checkout).

Bootstrapped/regtested x86_64-linux, OK?
(I will analyze more the t2.c failure)

Honza

gcc/ChangeLog:

2020-11-10  Jan Hubicka  

* gimple.c: Include ipa-modref-tree.h and ipa-modref.h.
(gimple_call_arg_flags): Use modref to determine flags.
* ipa-modref.c: Include gimple-ssa.h, tree-phinodes.h,
tree-ssa-operands.h, stringpool.h and tree-ssanames.h.
(analyze_ssa_name_flags): Declare.
(modref_summary::useful_p): Summary is also useful if arg flags are
known.
(dump_eaf_flags): New function.
(modref_summary::dump): Use it.
(get_modref_function_summary): Be read for current_function_decl
being NULL.
(memory_access_to): New function.
(deref_flags): New function.
(call_lhs_flags): New function.
(analyze_parms): New function.
(analyze_function): Use it.
* ipa-modref.h (struct modref_summary): Add arg_flags.

gcc/testsuite/ChangeLog:

2020-11-10  Jan Hubicka  

* gcc.dg/torture/pta-ptrarith-1.c: Escape parametrs.

diff --git a/gcc/gimple.c b/gcc/gimple.c
index 1afed88e1f1..da90716aa23 100644
--- a/gcc/gimple.c
+++ b/gcc/gimple.c
@@ -46,6 +46,8 @@ along with GCC; see the file COPYING3.  If not see
 #include "asan.h"
 #include "langhooks.h"
 #include "attr-fnspec.h"
+#include "ipa-modref-tree.h"
+#include "ipa-modref.h"
 
 
 /* All the tuples have their operand vector (if present) at the very bottom
@@ -1532,24 +1534,45 @@ int
 gimple_call_arg_flags (const gcall *stmt, unsigned arg)
 

Definition of EAF_NOESCAPE and fnspec strings

2020-11-08 Thread Jan Hubicka
Hi,
I implemented simple propagation of EAF arguments for ipa-modref (that is not
hard to do).  My main aim was to detect cases where THIS parameter does not
escape but is used to read/write pointed to memory.  This is meant to
avoid poor code produced when we i.e. offline destructors on cold path.

Unfortunately detecting EAF_NOESCAPE for such parameters breaks.  This is 
already
visible on memcpy if we let it to be be handled via its fnspec
attribute. If I disable special handling in structalias:

diff --git a/gcc/tree-ssa-structalias.c b/gcc/tree-ssa-structalias.c
index a4832b75436..2b614913b57 100644
--- a/gcc/tree-ssa-structalias.c
+++ b/gcc/tree-ssa-structalias.c
@@ -4389,7 +4389,6 @@ find_func_aliases_for_builtin_call (struct function *fn, 
gcall *t)
   case BUILT_IN_STRCPY:
   case BUILT_IN_STRNCPY:
   case BUILT_IN_BCOPY:
-  case BUILT_IN_MEMCPY:
   case BUILT_IN_MEMMOVE:
   case BUILT_IN_MEMPCPY:
   case BUILT_IN_STPCPY:

In the following testcase we miss the fact that memcpy may merge i4p PTA
set to i3p.

extern void exit (int);
extern void abort (void);
int size = 8;

int main (void)
{
  int i3 = -1, i4 = 55;
  int *i3p = 
  int *i4p = 

  memcpy(, , size);
  if (i3p != i4p)
abort ();
  exit (0);
}

This seems to be documented for some degree:

  /* Add *tem = nonlocal, do not add *tem = callused as
 EAF_NOESCAPE parameters do not escape to other parameters
 and all other uses appear in NONLOCAL as well.  */

But it also means that some of our FNSPECs are wrong now.  I wonder if we can
split this porperty to two different flags like EAF_NOEASCAPE and
EAF_INDIRECT_NOESCAPE?

Auto-detecting EAF_INDIRECT_NOESCAPE seems bit harder at least assuming
that any read can actually load few bits of pointer possibly written
there beause if simple member functions accesses values pointer by THIS
and return them we lose a track if the returned value is an escape point
I would say.

The difference in constraints are:

--- good/a.c.035t.ealias2020-11-08 13:34:04.397499515 +0100
+++ a.c.035t.ealias 2020-11-08 13:39:15.483438469 +0100
@@ -106,7 +106,15 @@
 size = NONLOCAL
 size.0_1 = size
 _2 = size.0_1
-i3p = i4p
+callarg(18) = 
+callarg(18) = callarg(18) + UNKNOWN
+CALLUSED(16) = callarg(18)
+CALLCLOBBERED(17) = callarg(18)
+*callarg(18) = NONLOCAL
+callarg(19) = 
+callarg(19) = callarg(19) + UNKNOWN
+CALLUSED(16) = callarg(19)
+ESCAPED = _2
 i3p.1_3 = i3p
 i4p.2_4 = i4p
 ESCAPED = 

...

 Call clobber information
 
-ESCAPED, points-to NULL, points-to vars: { }
+ESCAPED, points-to non-local, points-to NULL, points-to vars: { }
 
 Flow-insensitive points-to information
 
-i3p.1_3, points-to NULL, points-to vars: { D.1947 D.1948 }
+i3p.1_3, points-to non-local, points-to escaped, points-to NULL, points-to 
vars: { D.1947 }
 i4p.2_4, points-to NULL, points-to vars: { D.1948 }
 
 main ()

Honza


Re: Detect EAF flags in ipa-modref

2020-11-10 Thread Jan Hubicka
> 
> Alias oracle query stats:
>   refs_may_alias_p: 65808750 disambiguations, 75664890 queries
>   ref_maybe_used_by_call_p: 153485 disambiguations, 66711204 queries
>   call_may_clobber_ref_p: 22816 disambiguations, 28889 queries
>   nonoverlapping_component_refs_p: 0 disambiguations, 36846 queries
>   nonoverlapping_refs_since_match_p: 27271 disambiguations, 58917 must 
> overlaps, 86958 queries
>   aliasing_component_refs_p: 65808 disambiguations, 2067256 queries
>   TBAA oracle: 25929211 disambiguations 60395141 queries
>12391384 are in alias set 0
>10783783 queries asked about the same object
>126 queries asked about the same alias set
>0 access volatile
>9598698 are dependent in the DAG
>1691939 are aritificially in conflict with void *
> 
> Modref stats:
>   modref use: 14284 disambiguations, 53336 queries
>   modref clobber: 1660281 disambiguations, 2130440 queries
>   4311165 tbaa queries (2.023603 per modref query)
>   685304 base compares (0.321673 per modref query)
> 
> PTA query stats:
>   pt_solution_includes: 959190 disambiguations, 13169678 queries
>   pt_solutions_intersect: 1050969 disambiguations, 13246686 queries

I re-run the mainline withtout EAF flag propagation and the results
are correct, so it is not due to independent change pushed between my
tests.

Honza


Re: Detect EAF flags in ipa-modref

2020-11-10 Thread Jan Hubicka
> Bootstrapped/regtested x86_64-linux, OK?
> (I will analyze more the t2.c failure)

I have found independent reproducer that is now in
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97775

Honza


Re: Detect EAF flags in ipa-modref

2020-11-10 Thread Jan Hubicka
> > Bootstrapped/regtested x86_64-linux, OK?
> > (I will analyze more the t2.c failure)
> 
> I have found independent reproducer that is now in
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97775
... and with Jakub's fix the testcase works now.
Honza
> 
> Honza


Re: Detect EAF flags in ipa-modref

2020-11-10 Thread Jan Hubicka
Hi,
here is updaed patch.

Honza

Bootstrapped/regtested x86_64-linux, OK (after the fnspec fixes)?

2020-11-10  Jan Hubicka  

* gimple.c: Include ipa-modref-tree.h and ipa-modref.h.
(gimple_call_arg_flags): Use modref to determine flags.
* ipa-modref.c: Include gimple-ssa.h, tree-phinodes.h,
tree-ssa-operands.h, stringpool.h and tree-ssanames.h.
(analyze_ssa_name_flags): Declare.
(modref_summary::useful_p): Summary is also useful if arg flags are
known.
(dump_eaf_flags): New function.
(modref_summary::dump): Use it.
(get_modref_function_summary): Be read for current_function_decl
being NULL.
(memory_access_to): New function.
(deref_flags): New function.
(call_lhs_flags): New function.
(analyze_parms): New function.
(analyze_function): Use it.
* ipa-modref.h (struct modref_summary): Add arg_flags.
* doc/invoke.texi (ipa-modref-max-depth): Document.
* params.opt (ipa-modref-max-depth): New param.

gcc/testsuite/ChangeLog:

2020-11-10  Jan Hubicka  

* gcc.dg/torture/pta-ptrarith-1.c: Escape parametrs.

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index d2a188d7c75..0bd76d2841e 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -12953,6 +12953,10 @@ memory locations using the mod/ref information.  This 
parameter ought to be
 bigger than @option{--param ipa-modref-max-bases} and @option{--param
 ipa-modref-max-refs}.
 
+@item ipa-modref-max-depth
+Specified the maximum depth of DFS walk used by modref escape analysis.
+Setting to 0 disables the analysis completely.
+
 @item profile-func-internal-id
 A parameter to control whether to use function internal id in profile
 database lookup. If the value is 0, the compiler uses an id that
diff --git a/gcc/gimple.c b/gcc/gimple.c
index 1afed88e1f1..da90716aa23 100644
--- a/gcc/gimple.c
+++ b/gcc/gimple.c
@@ -46,6 +46,8 @@ along with GCC; see the file COPYING3.  If not see
 #include "asan.h"
 #include "langhooks.h"
 #include "attr-fnspec.h"
+#include "ipa-modref-tree.h"
+#include "ipa-modref.h"
 
 
 /* All the tuples have their operand vector (if present) at the very bottom
@@ -1532,24 +1534,45 @@ int
 gimple_call_arg_flags (const gcall *stmt, unsigned arg)
 {
   attr_fnspec fnspec = gimple_call_fnspec (stmt);
-
-  if (!fnspec.known_p ())
-return 0;
-
   int flags = 0;
 
-  if (!fnspec.arg_specified_p (arg))
-;
-  else if (!fnspec.arg_used_p (arg))
-flags = EAF_UNUSED;
-  else
+  if (fnspec.known_p ())
 {
-  if (fnspec.arg_direct_p (arg))
-   flags |= EAF_DIRECT;
-  if (fnspec.arg_noescape_p (arg))
-   flags |= EAF_NOESCAPE;
-  if (fnspec.arg_readonly_p (arg))
-   flags |= EAF_NOCLOBBER;
+  if (!fnspec.arg_specified_p (arg))
+   ;
+  else if (!fnspec.arg_used_p (arg))
+   flags = EAF_UNUSED;
+  else
+   {
+ if (fnspec.arg_direct_p (arg))
+   flags |= EAF_DIRECT;
+ if (fnspec.arg_noescape_p (arg))
+   flags |= EAF_NOESCAPE;
+ if (fnspec.arg_readonly_p (arg))
+   flags |= EAF_NOCLOBBER;
+   }
+}
+  tree callee = gimple_call_fndecl (stmt);
+  if (callee)
+{
+  cgraph_node *node = cgraph_node::get (callee);
+  modref_summary *summary = node ? get_modref_function_summary (node)
+   : NULL;
+
+  if (summary && summary->arg_flags.length () > arg)
+   {
+ int modref_flags = summary->arg_flags[arg];
+
+ /* We have possibly optimized out load.  Be conservative here.  */
+ if (!node->binds_to_current_def_p ())
+   {
+ if ((modref_flags & EAF_UNUSED) && !(flags & EAF_UNUSED))
+   modref_flags &= ~EAF_UNUSED;
+ if ((modref_flags & EAF_DIRECT) && !(flags & EAF_DIRECT))
+   modref_flags &= ~EAF_DIRECT;
+   }
+ flags |= modref_flags;
+   }
 }
   return flags;
 }
diff --git a/gcc/ipa-modref.c b/gcc/ipa-modref.c
index 3f46bebed3c..30e76580fb0 100644
--- a/gcc/ipa-modref.c
+++ b/gcc/ipa-modref.c
@@ -61,6 +61,15 @@ along with GCC; see the file COPYING3.  If not see
 #include "ipa-fnsummary.h"
 #include "attr-fnspec.h"
 #include "symtab-clones.h"
+#include "gimple-ssa.h"
+#include "tree-phinodes.h"
+#include "tree-ssa-operands.h"
+#include "ssa-iterators.h"
+#include "stringpool.h"
+#include "tree-ssanames.h"
+
+static int analyze_ssa_name_flags (tree name,
+  vec _flags, int depth);
 
 /* We record fnspec specifiers for call edges since they depends on actual
gimple statements.  */
@@ -186,6 +195,8 @@ modref_summary::useful_p (int ecf_flags)
 {
   if (ecf_flags & (ECF_CONST | ECF_NOVOPS))
 return

Re: Detect EAF flags in ipa-modref

2020-11-10 Thread Jan Hubicka
> > +  tree callee = gimple_call_fndecl (stmt);
> > +  if (callee)
> > +{
> > +  cgraph_node *node = cgraph_node::get (callee);
> > +  modref_summary *summary = node ? get_modref_function_summary (node)
> > +   : NULL;
> > +
> > +  if (summary && summary->arg_flags.length () > arg)
> 
> So could we make modref "transform" push this as fnspec attribute or
> would that not really be an optimization?

It was my original plan to synthetize fnspecs, but I think it is not
very good idea: we have the summary readily available and we can
represent information that fnspecs can't
(do not have artificial limits on number of parameters or counts)

I would preffer fnspecs to be used only for in-compiler declarations.
> > +
> > +/* Analyze EAF flags for SSA name NAME.
> > +   KNOWN_FLAGS is a cache for flags we already determined.
> > +   DEPTH is a recursion depth used to make debug output prettier.  */
> > +
> > +static int
> > +analyze_ssa_name_flags (tree name, vec *known_flags, int depth)
> 
> C++ has references which makes the access to known_flags nicer ;)

Yay, will chang that :)
> 
> > +{
> > +  imm_use_iterator ui;
> > +  gimple *use_stmt;
> > +  int flags = EAF_DIRECT | EAF_NOCLOBBER | EAF_NOESCAPE | EAF_UNUSED;
> > +
> > +  /* See if value is already computed.  */
> > +  if ((*known_flags)[SSA_NAME_VERSION (name)])
> > +{
> > +  /* Punt on cycles for now, so we do not need dataflow.  */
> > +  if ((*known_flags)[SSA_NAME_VERSION (name)] == 1)
> > +   {
> > + if (dump_file)
> > +   fprintf (dump_file,
> > +"%*sGiving up on a cycle in SSA graph\n", depth * 4, "");
> > + return 0;
> > +   }
> > +  return (*known_flags)[SSA_NAME_VERSION (name)] - 2;
> > +}
> > +  /* Recursion guard.  */
> > +  (*known_flags)[SSA_NAME_VERSION (name)] = 1;
> 
> This also guards against multiple evaluations of the same stmts
> but only in some cases?  Consider
> 
>   _1 = ..;
>   _2 = _1 + _3;
>   _4 = _1 + _5;
>   _6 = _2 + _4;
> 
> where we visit _2 = and _4 = from _1 but from both are going
> to visit _6.

Here we first push _6, then we go for _2 then for _1 evaluate _1,
evalueate _2, go for _4 and evaluate _4, and evaluate _6.
It is DFS and you need backward edge in DFS (comming from a PHI).

Cycles seems to somewhat matter for GCC: we do have a lot of functions
that walk linked lists that we could track otherwise.
> 
> Maybe I'm blind but you're not limiting depth?  Guess that asks
> for problems, esp. as you are recursing rather than using a
> worklist or so?
> 
> I see you try to "optimize" the walk by only visiting def->use
> links from parameters but then a RPO walk over all stmts would
> be simpler iteration-wise ...
We usually evaluate just small part of bigger functions (since we lose
track quite easily after hitting first memory store).  My plan was to
change this to actual dataflow once we have it well defined 
(this means after discussing EAF flags with you and adding the logic to
track callsites for true IPA pass that midly complicated things - for
every ssa name I track callsite/arg pair where it is passed to
either directly or indirectly.  Then this is translaed into call summary
and used by IPA pass to compute final flags)

I guess I can add --param ipa-modref-walk-depth for now and handle
dataflow incremntally?
In particular I am not sure if I should just write iterated RPO myself
or use tree-ssa-propagate.h (the second may be overkill).
> 
> > +  if (dump_file)
> > +{
> > +  fprintf (dump_file,
> > +  "%*sAnalyzing flags of ssa name: ", depth * 4, "");
> > +  print_generic_expr (dump_file, name);
> > +  fprintf (dump_file, "\n");
> > +}
> > +
> > +  FOR_EACH_IMM_USE_STMT (use_stmt, ui, name)
> > +{
> > +  if (flags == 0)
> > +   {
> > + BREAK_FROM_IMM_USE_STMT (ui);
> > +   }
> > +  if (is_gimple_debug (use_stmt))
> > +   continue;
> > +  if (dump_file)
> > +   {
> > + fprintf (dump_file, "%*s  Analyzing stmt:", depth * 4, "");
> > + print_gimple_stmt (dump_file, use_stmt, 0);
> > +   }
> > +
> > +  /* Gimple return may load the return value.  */
> > +  if (gimple_code (use_stmt) == GIMPLE_RETURN)
> 
>  if (greturn *ret = dyn_cast  (use_stmt))
> 
> makes the as_a below not needed, similar for the other cases (it
> also makes accesses cheaper, avoiding some checking code).

Looks cleaner indeed.
> 
> > +   {
> > + if (memory_access_to (gimple_return_retval
> > +  (as_a  (use_stmt)), name))
> > +   flags &= ~EAF_UNUSED;
> > +   }
> > +  /* Account for LHS store, arg loads and flags from callee function.  
> > */
> > +  else if (is_gimple_call (use_stmt))
> > +   {
> > + tree callee = gimple_call_fndecl (use_stmt);
> > +
> > + /* Recursion would require bit of propagation; give up for now.  */
> > + if (callee && recursive_call_p (current_function_decl, callee))
> > +   flags = 0;
> > + else
> > + 

Re: Detect EAF flags in ipa-modref

2020-11-16 Thread Jan Hubicka
> > 
> > Richi, I think we can add "safe" parameter to gimple_call_arg_flags and
> > bypass this logic when we use it for warnings only (having body that
> > does not use the value is quite strong hint that it is unused by the
> > function).
> 
> Eh, please not.

OK, I do not care that much as long as we do not have false positives
everywhere :)

Hadling EAF_UNUSED and CONST functions is necessary so we do not get
false positive caused by us optimizing code out.  In this case of not
trusing EAF_UNUSED flag we will not optimize, so I do not really care.

Martin, we collected very many warnings when building with
configure --with-build-config=bootstrap-lto.mk
This patch fixes some of them, but there are many others, can you take a
look?

For the testcase in PR I think it is enough to use default_is_empty_type
to disable the warning, but it is not clear to me why the code uses
default_is_empty_record at first place.
> 
> > 
> > I played with bit more testcases and found that we also want to disable
> > warning for const functions and sometimes EAF_UNUSED flag is dropped
> > becaue of clobber that is not necessary to do.  If function only clobber
> > the target it can be considered unused past inlining.
> > 
> > I am testing this improved patch and plan to commit if there are no
> > complains, but still we need to handle binds_to_current_def.
> > 
> > On the other direction, Martin, I think we may also warn for args
> > that are !EAF_UNUSED and !EAF_NOCLOBBER.  This will catch some cases
> > where user did not add "const" specifier to the declaration but
> > parameter is detected to be readonly.
> > 
> > I also noticed that we do not detect EAF_UNUSED for fully unused
> > parameters (with no SSA names associated with them).  I will fix that
> > incrementally.
> 
> Make sure to not apply it based on that reason to aggregates ;)

Sure, we already have detection of unused params in ipa-prop, so I guess
we want is_giple_ref (parm) && !default_def to imply EAF_UNUSED.

Honza


Re: Detect EAF flags in ipa-modref

2020-11-16 Thread Jan Hubicka
> On Nov 15 2020, Jan Hubicka wrote:
> 
> >> See PR97840.
> > Thanks,
> > this is a false positive where we fail to discover that pointed-to type
> > is empty on non-x86_64 targets.  This is triggered by better alias
> > analysis caused by non-escape discovery.
> >
> > While this is not a full fix (I hope someone with more experience on
> > C++ types and warnings will set up) I think it may be useful to avoid
> > warning on unused parameter.
> >
> > Bootstrapped/regtested x86_64-linux, OK?
> 
> That doesn't fix anything.

It does silence the warning if you remove inline from function
declaration (as creduce while minimizing the testcase - the minimized
testcase was undefined that is why I did not include it at the end).
I now implemented one by hand.

The reason is that gimple_call_arg_flags clears EAF_UNUSED
on symbols that !binds_to_current_def_p because we are worried that
symbol will be interposed by different version of the body with
same semantics that will actually read the arg.
This is bit paranoid check since we optimize things like *a == *a early
but with clang *a will trap if a==NULL.

Richi, I think we can add "safe" parameter to gimple_call_arg_flags and
bypass this logic when we use it for warnings only (having body that
does not use the value is quite strong hint that it is unused by the
function).

I played with bit more testcases and found that we also want to disable
warning for const functions and sometimes EAF_UNUSED flag is dropped
becaue of clobber that is not necessary to do.  If function only clobber
the target it can be considered unused past inlining.

I am testing this improved patch and plan to commit if there are no
complains, but still we need to handle binds_to_current_def.

On the other direction, Martin, I think we may also warn for args
that are !EAF_UNUSED and !EAF_NOCLOBBER.  This will catch some cases
where user did not add "const" specifier to the declaration but
parameter is detected to be readonly.

I also noticed that we do not detect EAF_UNUSED for fully unused
parameters (with no SSA names associated with them).  I will fix that
incrementally.

Honza

PR middle-end/97840
* ipa-modref.c (analyze_ssa_name_flags): Skip clobbers if inlining
is done.
* tree-ssa-uninit.c (maybe_warn_pass_by_reference): Make stmt gcall;
skip const calls and unused arguments.
    (warn_uninitialized_vars): Update prototype.

gcc/testsuite/ChangeLog:

2020-11-16  Jan Hubicka  

* g++.dg/warn/unit-2.C: New test.

diff --git a/gcc/ipa-modref.c b/gcc/ipa-modref.c
index 4a43c50aa66..08fcc36a321 100644
--- a/gcc/ipa-modref.c
+++ b/gcc/ipa-modref.c
@@ -1333,7 +1331,14 @@ analyze_ssa_name_flags (tree name, vec 
_flags, int depth)
  /* Handle *name = exp.  */
  else if (assign
   && memory_access_to (gimple_assign_lhs (assign), name))
-   flags &= ~(EAF_UNUSED | EAF_NOCLOBBER);
+   {
+ /* In general we can not ignore clobbers because they are
+barriers for code motion, however after inlining it is safe to
+do because local optimization passes do not consider clobbers
+from other functions.  Similar logic is in ipa-pure-const.c.  
*/
+ if (!cfun->after_inlining && !gimple_clobber_p (assign))
+   flags &= ~(EAF_UNUSED | EAF_NOCLOBBER);
+   }
  /* ASM statements etc.  */
  else if (!assign)
{
diff --git a/gcc/testsuite/g++.dg/warn/unit-2.C 
b/gcc/testsuite/g++.dg/warn/unit-2.C
new file mode 100644
index 000..30f3ceae191
--- /dev/null
+++ b/gcc/testsuite/g++.dg/warn/unit-2.C
@@ -0,0 +1,29 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -Wmaybe-uninitialized" } */
+struct a {int a;};
+__attribute__ ((noinline))
+void
+nowarn (const struct a *ptr)
+{
+  if (ptr)
+asm volatile ("");
+}
+void
+test()
+{
+  struct a ptr;
+  nowarn ();
+}
+__attribute__ ((noinline))
+int
+nowarn2 (const struct a *ptr, const struct a ptr2)
+{
+  return ptr != 0 || ptr2.a;
+}
+int mem;
+int
+test2()
+{
+  struct a ptr,ptr2={0};
+  return nowarn2 (, ptr2);
+}
diff --git a/gcc/tree-ssa-uninit.c b/gcc/tree-ssa-uninit.c
index f23514395e0..c94831bfb75 100644
--- a/gcc/tree-ssa-uninit.c
+++ b/gcc/tree-ssa-uninit.c
@@ -443,7 +443,7 @@ maybe_warn_operand (ao_ref , gimple *stmt, tree lhs, 
tree rhs,
access implying read access to those objects.  */
 
 static void
-maybe_warn_pass_by_reference (gimple *stmt, wlimits )
+maybe_warn_pass_by_reference (gcall *stmt, wlimits )
 {
   if (!wlims.wmaybe_uninit)
 return;
@@ -457,6 +457,10 @@ maybe_warn_pass_by_reference (gimple *stmt, wlimits )
   if (!fntype)
 return;
 
+  /* Const function do not read their arguments.  */
+  if (gimple_call_flags (stmt) & ECF_CONST)
+return;
+
   const built_in_

Re: Detect EAF flags in ipa-modref

2020-11-16 Thread Jan Hubicka
> On 11/16/20 1:44 PM, Jan Hubicka wrote:
> > Martin, we collected very many warnings when building with
> > configure --with-build-config=bootstrap-lto.mk
> > This patch fixes some of them, but there are many others, can you take a
> > look?
> 
> Hello.
> 
> I guess you mean Martin Jambor, or me? Please CC :)
Martin Sebor, added to CC (since he did most of work on tree-ssa-uninit
recently)

I filled some PRs on the isues now.
Honza
> 
> Martin


Re: Definition of EAF_NOESCAPE and fnspec strings

2020-11-09 Thread Jan Hubicka
> > But it also means that some of our FNSPECs are wrong now.  I wonder if we 
> > can
> > split this porperty to two different flags like EAF_NOEASCAPE and
> > EAF_INDIRECT_NOESCAPE?
> 
> Note that EAF_NOESCAPE allows "escaping" to the return value (see
> handle_rhs_call).  I guess for simplicity we could allow escaping

I see, missed that! 

> of not escaped but USED params to other written to params.  I think
> we also don't handle "escaping" of a parameter indirectly to itself, thus
> 
> int i;
> int *p = 
> int **q = 
> foo ();
> if (*q != i)
>   abort ();
> 
> and
> 
> foo (int ***p)
> {
>   *p = **p;
> } 
> 
> or so with foos param EAF_NOESCAPE (but not EAF_DIRECT).
> 
> Splitting up EAF_NOESCAPE makes it quite difficult to understand.
> Arguably explicit handling of memcpy and friends _does_ pay off
> for points-to analysis since I'd say modelling all possible and useful
> things in fnspec would make it a monster ... you'd basically want
> to have a way to specify additional constraints in the fnspec itself,
> like *1 = *2, but then also distinguish points-to effects from
> may-alias ones.

Yep, i am not arguing for eliminating special case of memcpy (because we
have the additional info that it only copies pointers from *src to
*dest).

However I find current definition of EAF_NOESCAPE bit hard to handle in
modref, since naturally it is quite reliable to track all uses of ssa
name that correspond to parameters, but it is harder to track where
values read from pointed-to memory can eventually go.

What I do is to walk all uses of SSA_NAME that correspond to parameter
and try to unerstand it.  If it is one of
 1) memory load via derefence of name
 2) memory store where name is base of LHS
 3) memory store where name is rhs
 4) name is passed as a value to function 
 5) dereferenced value from name is passed to funtion
 6) used as value normal gimple expression
 7) used in return (as RHS or base of memory dereference address)
 8) it is used only in reference but not as base (as array index or so)

Then I merge in (and) flags I determine as follow:

For 1) clear EAF_USED and recurse to LHS name. Based on its flags I
decide on:
  - EAF_DIRECT (if LHS has EAF_UNUSED),
  - EAF_NOCLOBBER (if LHS has EAF_NOCLOBBER)
  - EAF_NOESCAPE (if LHS has EAF_NOESCAPE). 

For 2) I clear EAF_NOCLOBBER and EAF_UNUSED flag

For 3) I give up (clear all flags) since I would need to track where the
memory is going.

For 4) I determine flag of function parameter

For 5) I need to do same handling as 1) where flag of "loaded value" is
flag of the function

For 6) I determine flags of LHS and merge them in

For 7) I clear NOESCAPE if rhs is name itself
and UNUSED + NOESCAPE if rhs is derefernece from name.

For 8) I do nothing.  Here the names are non-pointers that I track
because of earlier dereference.



So I think 7) can be relaxed.  Main problem is hoever that we often see 1)
and then 3) or 7) on LHS that makes us punt very often.

The fact that pointer directly does not escape but pointed to memory can
seems still very useful since one does not need to add *ptr to points-to
sets. But I will try relaxing 7).

If we allow values escaping to other parameters and itself, I think I
can relax 3) if base of the store is default def of PARM_DECL.
> 
> I wonder if we should teach the GIMPLE FE to parse 'fn spec'
> so we can write unit tests for the attribute ... or maybe simply
> add this to the __GIMPLE spec string.

May be nice and also describe carefully that NOESCAPE and NOCLOBBER also
reffers to indirect references.  Current description
"Nonzero if the argument does not escape."
reads to me that it is about ptr itself, not about *ptr and also it does
not speak of the escaping to return value etc.

Honza


Detect EAF flags in ipa-modref

2020-11-09 Thread Jan Hubicka
> > 
> > Yep, i am not arguing for eliminating special case of memcpy (because we
> > have the additional info that it only copies pointers from *src to
> > *dest).
> > 
> > However I find current definition of EAF_NOESCAPE bit hard to handle in
> > modref, since naturally it is quite reliable to track all uses of ssa
> > name that correspond to parameters, but it is harder to track where
> > values read from pointed-to memory can eventually go.
> 
> Yeah - also the fnspec of memcpy _is_ wrong at the moment ...

Yep.
> > 
> > For 6) I determine flags of LHS and merge them in
> 
> guess you could track a SSA lattice "based on parameter N" which
> would need to be a mask (thus only track up to 32 parameters?)

Yes, I can do that if we relax NOESCAPE this way.
> 
> > For 7) I clear NOESCAPE if rhs is name itself
> > and UNUSED + NOESCAPE if rhs is derefernece from name.
> > 
> > For 8) I do nothing.  Here the names are non-pointers that I track
> > because of earlier dereference.
> > 
> > 
> > 
> > So I think 7) can be relaxed.  Main problem is hoever that we often see 1)
> > and then 3) or 7) on LHS that makes us punt very often.
> > 
> > The fact that pointer directly does not escape but pointed to memory can
> > seems still very useful since one does not need to add *ptr to points-to
> > sets. But I will try relaxing 7).
> > 
> > If we allow values escaping to other parameters and itself, I think I
> > can relax 3) if base of the store is default def of PARM_DECL.
> 
> I think the important part is to get things correct.  Maybe it's worth

Indeed :)
> to add write/read flags where the argument _does_ escape in case the
> function itself is otherwise pure/const.  For PTA that doesn't make
> a difference (and fnspec was all about PTA ...) but for alias-analysis
> it does.

I detect them independently (UNUSED/NOCLOBBER flags which is not perfect
since we do not have OUTPUT flag like in fnspecs), but currently they
are unused since we do not track "pure/const except for known
exceptions".  This is not hard to add.
> 
> > > 
> > > I wonder if we should teach the GIMPLE FE to parse 'fn spec'
> > > so we can write unit tests for the attribute ... or maybe simply
> > > add this to the __GIMPLE spec string.
> > 
> > May be nice and also describe carefully that NOESCAPE and NOCLOBBER also
> > reffers to indirect references.  Current description
> > "Nonzero if the argument does not escape."
> > reads to me that it is about ptr itself, not about *ptr and also it does
> > not speak of the escaping to return value etc.
> 
> Well, if 'ptr' escapes then obvoiously all memory reachable from 'ptr'
> escapes - escaping is always transitive.

Yes, but if values pointed to by ptr escapes, ptr itself does not need
to escape.  This is easy to detect (and is common case of THIS pointer)
but we have no way to express it via EAF flags.
> 
> And as escaping is in the context of the caller sth escaping to the
> caller itself (via return) can hardly be considered escaping (again
> this was designed for PTA ...).
> 
> I guess it makes sense to be able to separate escaping from the rest.
I think current definition (escaping via return is not an escape) is
also OK for modref propagation.  We may have EAF_NORETURN that says that
value never escapes to return value that would be also easy to detect.

This is kind of minimal patch for the EAF flags discovery.  It works
only in local ipa-modref and gives up on cyclic SSA graphs.  Adding
propagation is easy and proper IPA mode needs collecting call sites+arg
pairs that affect the answer.

It passes testuite except for sso/t2.c testcase where it affects bit of first
dumped structure R1.
We correctly determine that it is noescape/noclobber from the dump function and
it seems that this triggers kind of strange SRA.  I will look into it.

I am running full bootstrap/regtest.

On tramp3d the effect is not great, but it does something.

PTA query stats:
  pt_solution_includes: 397269 disambiguations, 606922 queries
  pt_solutions_intersect: 138302 disambiguations, 416884 queries

to

PTA query stats:
  pt_solution_includes: 401540 disambiguations, 609093 queries
  pt_solutions_intersect: 138616 disambiguations, 417174 queries

2020-11-09  Jan Hubicka  

* builtins.c (builtin_fnspec): Fix fnspecs of memcpy and friends.
* gimple.c: Include ipa-modref-tree.h and ipa-mdoref.h.
(gimple_call_arg_flags): Use modref to determine flags.
* ipa-modref.c: Include gimple-ssa.h, tree-phinodes.h,
tree-ssa-operands.h, stringpool.h and tree-ssanames.h.
(modref_summary::useful_p): Summary is also useful if EAF flags are
known

Re: Detect EAF flags in ipa-modref

2020-11-15 Thread Jan Hubicka
> This breaks bootstrap with go.
> 
> ../../gcc/go/gofrontend/go-diagnostics.cc: In function 'std::string 
> expand_message(const char*, va_list)':
> ../../gcc/go/gofrontend/go-diagnostics.cc:110:61: error: '' may be 
> used uninitialized [-Werror=maybe-uninitialized]
>   110 |  "memory allocation failed in vasprintf");
>   | ^

Well, this may be simply a false positive (except that the warning is
clearly mislocalted and cryptic) I was bootstrapping with go just few
minutes ago, so I wonder what configure flags you use?

Honza
> 
> Andreas.
> 
> -- 
> Andreas Schwab, sch...@linux-m68k.org
> GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
> "And now for something completely different."


Re: Detect EAF flags in ipa-modref

2020-11-15 Thread Jan Hubicka
> On Fri, Nov 13, 2020 at 12:07 AM Richard Biener  wrote:
> >
> > On Tue, 10 Nov 2020, Jan Hubicka wrote:
> >
> > > Hi,
> > > here is updaed patch.
> > >
> > > Honza
> > >
> > > Bootstrapped/regtested x86_64-linux, OK (after the fnspec fixes)?
> >
> > OK.
> >
> > Thanks,
> > Richard.
> >
> 
> This caused:
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97836

I have pushed the following fix as obvious.  The problem is caused by
fact that I made ipa-modref to set EAF_UNUSED flag for parameters that
are used as scalars but not used to read memory since that suits the
use of the flag in tree-ssa-alias and mostly tree-ssa-structalias with
exception of the rule of escaping to return value.

At monday I will disuss with Richard if we want to adjust EAF_UNUSED or
invernt EAF_NOREAD, but this should prevent wrong code issues.

Honza

2020-11-15  Jan Hubicka  

PR ipa/97836
* ipa-modref.c (analyze_ssa_name_flags): Make return to clear
EAF_UNUSED flag.

gcc/testsuite/ChangeLog:

2020-11-15  Jan Hubicka  

* gcc.c-torture/execute/pr97836.c: New test.

diff --git a/gcc/ipa-modref.c b/gcc/ipa-modref.c
index 5273c200f00..4a43c50aa66 100644
--- a/gcc/ipa-modref.c
+++ b/gcc/ipa-modref.c
@@ -1224,10 +1224,12 @@ analyze_ssa_name_flags (tree name, vec 
_flags, int depth)
  print_gimple_stmt (dump_file, use_stmt, 0);
}
 
-  /* Gimple return may load the return value.  */
+  /* Gimple return may load the return value.
+Returning name counts as an use by tree-ssa-structalias.c  */
   if (greturn *ret = dyn_cast  (use_stmt))
{
- if (memory_access_to (gimple_return_retval (ret), name))
+ if (memory_access_to (gimple_return_retval (ret), name)
+ || name == gimple_return_retval (ret))
flags &= ~EAF_UNUSED;
}
   /* Account for LHS store, arg loads and flags from callee function.  */
diff --git a/gcc/testsuite/gcc.c-torture/execute/pr97836.c 
b/gcc/testsuite/gcc.c-torture/execute/pr97836.c
new file mode 100644
index 000..4585e1fc69d
--- /dev/null
+++ b/gcc/testsuite/gcc.c-torture/execute/pr97836.c
@@ -0,0 +1,17 @@
+int a;
+
+int b(int c) { return 0; }
+
+static int *d(int *e) {
+  if (a) {
+a = a && b(*e);
+  }
+  return e;
+}
+
+int main() {
+  int f;
+  if (d() != )
+__builtin_abort();
+  return 0;
+}


Re: Detect EAF flags in ipa-modref

2020-11-15 Thread Jan Hubicka
> Hi Jan,
> 
> >> This breaks bootstrap with go.
> >> 
> >> ../../gcc/go/gofrontend/go-diagnostics.cc: In function 'std::string 
> >> expand_message(const char*, va_list)':
> >> ../../gcc/go/gofrontend/go-diagnostics.cc:110:61: error: '' may 
> >> be used uninitialized [-Werror=maybe-uninitialized]
> >>   110 |  "memory allocation failed in vasprintf");
> >>   | ^
> >
> > Well, this may be simply a false positive (except that the warning is
> > clearly mislocalted and cryptic) I was bootstrapping with go just few
> > minutes ago, so I wonder what configure flags you use?
> 
> I'm seeing the same on both i386-pc-solaris2.11 and
> sparc-sun-solaris2.11, so I don't think there are special configure
> flags involved.

When I configure with ../configure --enable-languages=go my build
eventually finishes fine on curren trunk:

/aux/hubicka/trunk-git/build-go/./gcc/gccgo 
-B/aux/hubicka/trunk-git/build-go/./gcc/ -B/usr/local/x86_64-pc-linux-gnu/bin/ 
-B/usr/local/x86_64-pc-linux-gnu/lib/ -isystem 
/usr/local/x86_64-pc-linux-gnu/include -isystem 
/usr/local/x86_64-pc-linux-gnu/sys-include   -g -O2 -I 
../x86_64-pc-linux-gnu/libgo -static-libstdc++ -static-libgcc  -L 
../x86_64-pc-linux-gnu/libgo -L ../x86_64-pc-linux-gnu/libgo/.libs -o cgo 
../../gotools/../libgo/go/cmd/cgo/ast.go 
../../gotools/../libgo/go/cmd/cgo/doc.go 
../../gotools/../libgo/go/cmd/cgo/gcc.go 
../../gotools/../libgo/go/cmd/cgo/godefs.go 
../../gotools/../libgo/go/cmd/cgo/main.go 
../../gotools/../libgo/go/cmd/cgo/out.go 
../../gotools/../libgo/go/cmd/cgo/util.go zdefaultcc.go 
../x86_64-pc-linux-gnu/libgo/libgotool.a
make[2]: Leaving directory '/aux/hubicka/trunk-git/build-go/gotools'
make[1]: Leaving directory '/aux/hubicka/trunk-git/build-go'

On Debian SLES machines.  Having preprocessed go-diagnostics.cc and
.uninit1 dump would probably help.

These are often false positives, but I saw similar warnings caused by
wrong EAF_UNUSED flag discovery which led to init code being optimized
out before inlining, so I would like to look into the dumps and figure
out if that is uninit acting funny or there is some real problem.

Honza
> 
>   Rainer
> 
> -- 
> -
> Rainer Orth, Center for Biotechnology, Bielefeld University


Re: Detect EAF flags in ipa-modref

2020-11-15 Thread Jan Hubicka
> Hi Jan,
> 
> >> I'm seeing the same on both i386-pc-solaris2.11 and
> >> sparc-sun-solaris2.11, so I don't think there are special configure
> >> flags involved.
> >
> > When I configure with ../configure --enable-languages=go my build
> > eventually finishes fine on curren trunk:
> [...]
> > On Debian SLES machines.  Having preprocessed go-diagnostics.cc and
> > .uninit1 dump would probably help.
> 
> apart from go/diagnostics.cc, I see the warning several times in
> libstdc++, but only for the 32-bit multilib.
> 
> Try building either for i686-pc-linux-gnu or a bi-arch
> x86_64-pc-linux-gnu build.

My build is bi-arch.  I will check libstdc++ warnings, but we had some
such warnings previously there too (plus we have tons of uninitialized
warnings with LTO bootstrap).

This really may depend on glibc headers and how string functions get
expanded.  This is why I think having preprocessed diagnotics.cc
should help.

Honza
> 
>   Rainer
> 
> -- 
> -
> Rainer Orth, Center for Biotechnology, Bielefeld University


Re: Detect EAF flags in ipa-modref

2020-11-15 Thread Jan Hubicka
> See PR97840.
Thanks,
this is a false positive where we fail to discover that pointed-to type
is empty on non-x86_64 targets.  This is triggered by better alias
analysis caused by non-escape discovery.

While this is not a full fix (I hope someone with more experience on
C++ types and warnings will set up) I think it may be useful to avoid
warning on unused parameter.

Bootstrapped/regtested x86_64-linux, OK?

PR middle-end/97840
* tree-ssa-uninit.c (maybe_warn_pass_by_reference): Update prototype;
silence warning on EAF_UNUSED parameters.
(warn_uninitialized_vars): Update.
diff --git a/gcc/tree-ssa-uninit.c b/gcc/tree-ssa-uninit.c
index f23514395e0..1e074793b02 100644
--- a/gcc/tree-ssa-uninit.c
+++ b/gcc/tree-ssa-uninit.c
@@ -443,7 +443,7 @@ maybe_warn_operand (ao_ref , gimple *stmt, tree lhs, 
tree rhs,
access implying read access to those objects.  */
 
 static void
-maybe_warn_pass_by_reference (gimple *stmt, wlimits )
+maybe_warn_pass_by_reference (gcall *stmt, wlimits )
 {
   if (!wlims.wmaybe_uninit)
 return;
@@ -501,6 +501,10 @@ maybe_warn_pass_by_reference (gimple *stmt, wlimits )
  && !TYPE_READONLY (TREE_TYPE (argtype)))
continue;
 
+ /* Ignore args we are not going to read from.  */
+ if (gimple_call_arg_flags (stmt, argno - 1) & EAF_UNUSED)
+   continue;
+
  if (save_always_executed && access->mode == access_read_only)
/* Attribute read_only arguments imply read access.  */
wlims.always_executed = true;
@@ -639,8 +643,8 @@ warn_uninitialized_vars (bool wmaybe_uninit)
  if (gimple_vdef (stmt))
wlims.vdef_cnt++;
 
- if (is_gimple_call (stmt))
-   maybe_warn_pass_by_reference (stmt, wlims);
+ if (gcall *call = dyn_cast  (stmt))
+   maybe_warn_pass_by_reference (call, wlims);
  else if (gimple_assign_load_p (stmt)
   && gimple_has_location (stmt))
{


Re: Where did my function go?

2020-10-21 Thread Jan Hubicka
> On Wed, Oct 21, 2020 at 5:21 AM Gary Oblock  wrote:
> >
> > >IPA transforms happens when get_body is called.  With LTO this also
> > >trigger reading the body from disk.  So if you want to see all bodies
> > >and work on them, you can simply call get_body on everything but it will
> > >result in increased memory use since everything will be loaded form disk
> > >and expanded (by inlining) at once instead of doing it on per-function
> > >basis.
> > Jan,
> >
> > Doing
> >
> > FOR_EACH_FUNCTION_WITH_GIMPLE_BODY ( node) node->get_body ();
> >
> > instead of
> >
> > FOR_EACH_FUNCTION_WITH_GIMPLE_BODY ( node) node->get_untransformed_body ();
> >
> > instantaneously breaks everything...
> 
> I think during WPA you cannot do ->get_body (), only
> ->get_untransformed_body ().  But
> we don't know yet where in the IPA process you're experiencing the issue.

Originally get_body is designed to work in WPA as well: the info about
what transforms are to be applied is kept in a vector with per-function
granuality. But there may be some issues as this path is untested and
i.e ipa-sra/ipa-prop does quite difficult transformations these days.
What happens?

Honza
> 
> Richard.
> 
> > Am I missing something?
> >
> > Gary
> > 
> > From: Jan Hubicka 
> > Sent: Tuesday, October 20, 2020 4:34 AM
> > To: Richard Biener 
> > Cc: GCC Development ; Gary Oblock 
> > 
> > Subject: Re: Where did my function go?
> >
> > [EXTERNAL EMAIL NOTICE: This email originated from an external sender. 
> > Please be mindful of safe email handling and proprietary information 
> > protection practices.]
> >
> >
> > > > On Tue, Oct 20, 2020 at 1:02 PM Martin Jambor  wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > On Tue, Oct 20 2020, Richard Biener wrote:
> > > > > > On Mon, Oct 19, 2020 at 7:52 PM Gary Oblock 
> > > > > >  wrote:
> > > > > >>
> > > > > >> Richard,
> > > > > >>
> > > > > >> I guess that will work for me. However, since it
> > > > > >> was decided to remove an identical function,
> > > > > >> why weren't the calls to it adjusted to reflect it?
> > > > > >> If the call wasn't transformed that means it will
> > > > > >> be mapped at some later time. Is that mapping
> > > > > >> available to look at? Because using that would
> > > > > >> also be a potential solution (assuming call
> > > > > >> graph information exists for the deleted function.)
> > > > > >
> > > > > > I'm not sure how the transitional cgraph looks like
> > > > > > during WPA analysis (which is what we're talking about?),
> > > > > > but definitely the IL is unmodified in that state.
> > > > > >
> > > > > > Maybe Martin has an idea.
> > > > > >
> > > > >
> > > > > Exactly, the cgraph_edges is where the correct call information is
> > > > > stored until the inlining transformation phase calls
> > > > > cgraph_edge::redirect_call_stmt_to_callee is called on it - inlining 
> > > > > is
> > > > > a special pass in this regard that performs this IPA-infrastructure
> > > > > function in addition to actual inlining.
> > > > >
> > > > > In cgraph means the callee itself but also information in
> > > > > e->callee->clone.param_adjustments which might be interesting for any
> > > > > struct-reorg-like optimizations (...and in future possibly in other
> > > > > transformation summaries).
> > > > >
> > > > > The late IPA passes are in very unfortunate spot here since they run
> > > > > before the real-IPA transformation phases but after unreachable node
> > > > > removals and after clone materializations and so can see some but not
> > > > > all of the changes performed by real IPA passes.  The reason for that 
> > > > > is
> > > > > good cache locality when late IPA passes are either not run at all or
> > > > > only look at small portion of the compilation unit.  In such case IPA
> > > > > transformations of a function are followed by all the late passes
> > > > > working on the same function.
> >

Re: State of AutoFDO in GCC

2021-05-09 Thread Jan Hubicka
> 
> > With my tests, AutoFDO could achieve almost half of the effect of
> > instrumentation FDO on real applications such as MySQL 8.0.20 .
> 
> Likely this could be improved with some of the missing changes. Apparently
> discriminator support is worth quite a bit especially on dense C++ code
> bases. Without that, gcc autofdo only works on line numbers, which can be
> very limiting if single lines have a lot of basic blocks.
> 
> Sadly discriminator support is currently only on the old Google branch and
> not in gcc mainline
> 
> Longer term it would probably be best to replace all this with some custom
> specialized binary annotation instead of stretching DWARF beyond its limits.

I think it makes sense to pick the AutoFDO to the lists of things that
would be nice to fix in GCC12.  I guess first we need to solve the
issues with the tool producing autofdo gcda files and once that works
setup some benchmarking so we know how things compare to FDO and they
get tested.

If you point me to the discriminator patches I can try to figure out how
hard would be to mainline them.  I am not too sure about custom
annotations though - storing info about regions of code segment is
always a pain and it is quite nice that dwarf provides support for that.
But I guess we could go step by step.  I first need a working setup ;)

Honza
> 
> -Andi
> 


Re: [PATCH] Try LTO partial linking. (Was: Speed of compiling gimple-match.c)

2021-05-20 Thread Jan Hubicka
> On Thu, May 20, 2021 at 3:16 PM Richard Biener
>  wrote:
> >
> > On Thu, May 20, 2021 at 3:06 PM Martin Liška  wrote:
> > >
> > > On 5/20/21 2:54 PM, Richard Biener wrote:
> > > > So why did you go from applying this per-file to multiple files?
> > >
> > > When I did per-file for {gimple,generic}-match.c I hit the following 
> > > issue with lto.priv symbols:
> > >
> > > /usr/lib64/gcc/x86_64-suse-linux/10/../../../../x86_64-suse-linux/bin/ld: 
> > > error: libbackend.a(generic-match.o): multiple definition of 
> > > 'wi::to_wide(tree_node const*) [clone .part.0] [clone .lto_priv.0]'
> > > /usr/lib64/gcc/x86_64-suse-linux/10/../../../../x86_64-suse-linux/bin/ld: 
> > > libbackend.a(gimple-match.o): previous definition here
> > > /usr/lib64/gcc/x86_64-suse-linux/10/../../../../x86_64-suse-linux/bin/ld: 
> > > error: libbackend.a(generic-match.o): multiple definition of 
> > > 'TYPE_VECTOR_SUBPARTS(tree_node const*) [clone .part.0] [clone 
> > > .lto_priv.0]'
> > > /usr/lib64/gcc/x86_64-suse-linux/10/../../../../x86_64-suse-linux/bin/ld: 
> > > libbackend.a(gimple-match.o): previous definition here
> > > /usr/lib64/gcc/x86_64-suse-linux/10/../../../../x86_64-suse-linux/bin/ld: 
> > > error: libbackend.a(generic-match.o): multiple definition of 
> > > 'vec::operator[](unsigned int) [clone 
> > > .part.0] [clone .lto_priv.0]'
> > >
> > > Any idea what was I doing wrong?
> >
> > Nothing in particular I think - you're just hitting the issue that LTO
> > produces new symbols and that those
> > can obviously clash.  Giuliano hit the very same issue.  When not
> > doing partial links those internal
> > symbols pose no problem, but with -r -flinker-output=nolto-rel and
> > re-linking the produced objects
> > they obviously do.  ELF has no solution for this though, but I think
> > we could strip those from the
> > partially linked object - if WPA would give us a list of objects the
> > link step could postprocess
> > the object with objcopy or maybe a custom linker script could do the
> > trick as well.
> 
> Oh, and the "best" solution would be to avoid involving the linker
> when doing -r -flinker-output=nolto-rel but instead have the assembler
> produce the single object from the multiple LTRANS assembly snippets
> which could then use local labels instead of symbols for these.

Quick solution is to also modify partitioner to use the local symbol
names when doing incremental linking (those mixing in source code and
random seeds) to avoid clashes.

Honza
> 
> > So your workaround is to only ever have a single LTO produced object
> > file participating in the
> > final links ;)
> >
> > Richard.
> >
> > >
> > > Martin


Re: State of AutoFDO in GCC

2021-04-23 Thread Jan Hubicka
> On Fri, Apr 23, 2021 at 12:18 AM Martin Liška  wrote:
> 
> > On 4/23/21 9:00 AM, Richard Biener via Gcc wrote:
> > > On Fri, Apr 23, 2021 at 7:28 AM Xinliang David Li via Gcc
> > >  wrote:
> > >>
> > >> Hi, the create_gcov tool was probably removed with the assumption that
> > it
> > >> was only used with Google GCC branch, but it is actually used with GCC
> > >> trunk as well.
> > >>
> > >> Given that, the tool will be restored in the github repo. It seems to
> > build
> > >> and work fine with the regression test.
> > >>
> > >> The tool may ust work as it is right now, but there is no guarantee it
> > >> won't break in the future unless someone in the GCC community tries to
> > >> maintain it.
> >
> > Hi.
> >
> > The current situation is that AutoFDO doesn't work with pretty simple
> > test-cases
> > we have in testsuite:
> >
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71672
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81379
> >
> > These are ~5 years old and nothing has happened.

Basic functionality is to identify hot parts of programs and give basic
idea of branch probabilities.  Bugs above are about other optimizations
(indirect inlining, hot-cold partitioning and peeling). First one ought
to be working but still indirect call profiling accounts for relatively
minor portion of FDO benefits, so auto-FDO should be usefable even w/o
these working if tooling was fixed.  Expecting hot-cold partitioning and
peeling to work reliably for testcases designed for normalFDO is IMO bit
unrealistic.

Perf is not limited to x86, so we should get basic usability for all
major architectures.  David, what CPUs are supported with auto-FDO on
LLVM side?
> 
> This is a major feature in Clang. Deprecating it in GCC will be unfortunate.

I also like overall idea of auto-FDO just never had much time to get it
into shape.  Every stage1 I am trying to fix something that was broken
for a while (like ICF last stage1) and fix it, so option would be to
focus on autoFDO this stage1 even though I have quite few other things
in TODO list.  I never looked into the tools translating perf data to
gcc compatible format and how easy would be to keep this alive.

David, how does perf->LLVM path work these days?

Honza
> 
> David
> 
> >
> > Thoughts?
> > Martin
> >
> > >
> > > Having the tool third-party makes keeping the whole chain working more
> > > difficult.
> > >
> > > Richard.
> > >
> > >> Thanks,
> > >>
> > >> David
> > >>
> > >> On Thu, Apr 22, 2021 at 3:29 PM Jan Hubicka  wrote:
> > >>
> > >>>> On 4/22/21 9:58 PM, Eugene Rozenfeld via Gcc wrote:
> > >>>>> GCC documentation for AutoFDO points to create_gcov tool that
> > converts
> > >>> perf.data file into gcov format that can be consumed by gcc with
> > >>> -fauto-profile (
> > https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html,
> > >>> https://gcc.gnu.org/wiki/AutoFDO/Tutorial).
> > >>>>>
> > >>>>> I noticed that the source code for create_gcov has been deleted from
> > >>> https://github.com/google/autofdo on April 7. I asked about that
> > change
> > >>> in that repo and got the following reply:
> > >>>>>
> > >>>>> https://github.com/google/autofdo/pull/107#issuecomment-819108738
> > >>>>>
> > >>>>> "Actually we didn't use create_gcov and havn't updated create_gcov
> > for
> > >>> years, and we also didn't have enough tests to guarantee it works (It
> > was
> > >>> gcc-4.8 when we used and verified create_gcov). If you need it, it is
> > >>> welcomed to update create_gcov and add it to the respository."
> > >>>>>
> > >>>>> Does this mean that AutoFDO is currently dead in gcc?
> > >>>>
> > >>>> Hello.
> > >>>>
> > >>>> Yes. I know that even basic test cases have been broken for years in
> > the
> > >>> GCC.
> > >>>> It's new to me that create_gcov was removed.
> > >>>>
> > >>>> I tend to send patch to GCC that will remove AutoFDO from GCC.
> > >>>> I known Bin spent some time working on AutoFDO, has he came up to
> > >>> something?
> > >>>
> > >>> The GCC side of auto-FDO is not that hard.  We have most of
> > >>> infrastructure in place, but stopping point for me was always
> > difficulty
> > >>> to get gcov-tool working.  If some maintainer steps up, I think I can
> > >>> fix GCC side.
> > >>>
> > >>> I am bit unsure how important feature it is - we have FDO that works
> > >>> quite well for most users but I know there are some users of the LLVM
> > >>> implementation and there is potential to tie this with other hardware
> > >>> events to asist i.e. if conversion (where one wants to know how well
> > CPU
> > >>> predicts the jump rather than just the jump probability) which I always
> > >>> found potentially interesting.
> > >>>
> > >>> Honza
> > >>>>
> > >>>> Martin
> > >>>>
> > >>>>>
> > >>>>> Thanks,
> > >>>>>
> > >>>>> Eugene
> > >>>>>
> > >>>>
> > >>>
> >
> >


Re: State of AutoFDO in GCC

2021-04-23 Thread Jan Hubicka
> 
> It uses create_llvm_prof tool which is in the same git repo. The data
> parsing part is shared with create_gcov, but the writer is obviously
> different for the two tools.

OK and what are the main differences between llvmand gcc format?

Honza
> 
> David
> 
> 
> > Honza
> > >
> > > David
> > >
> > > >
> > > > Thoughts?
> > > > Martin
> > > >
> > > > >
> > > > > Having the tool third-party makes keeping the whole chain working
> > more
> > > > > difficult.
> > > > >
> > > > > Richard.
> > > > >
> > > > >> Thanks,
> > > > >>
> > > > >> David
> > > > >>
> > > > >> On Thu, Apr 22, 2021 at 3:29 PM Jan Hubicka  wrote:
> > > > >>
> > > > >>>> On 4/22/21 9:58 PM, Eugene Rozenfeld via Gcc wrote:
> > > > >>>>> GCC documentation for AutoFDO points to create_gcov tool that
> > > > converts
> > > > >>> perf.data file into gcov format that can be consumed by gcc with
> > > > >>> -fauto-profile (
> > > > https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html,
> > > > >>> https://gcc.gnu.org/wiki/AutoFDO/Tutorial).
> > > > >>>>>
> > > > >>>>> I noticed that the source code for create_gcov has been deleted
> > from
> > > > >>> https://github.com/google/autofdo on April 7. I asked about that
> > > > change
> > > > >>> in that repo and got the following reply:
> > > > >>>>>
> > > > >>>>>
> > https://github.com/google/autofdo/pull/107#issuecomment-819108738
> > > > >>>>>
> > > > >>>>> "Actually we didn't use create_gcov and havn't updated
> > create_gcov
> > > > for
> > > > >>> years, and we also didn't have enough tests to guarantee it works
> > (It
> > > > was
> > > > >>> gcc-4.8 when we used and verified create_gcov). If you need it, it
> > is
> > > > >>> welcomed to update create_gcov and add it to the respository."
> > > > >>>>>
> > > > >>>>> Does this mean that AutoFDO is currently dead in gcc?
> > > > >>>>
> > > > >>>> Hello.
> > > > >>>>
> > > > >>>> Yes. I know that even basic test cases have been broken for years
> > in
> > > > the
> > > > >>> GCC.
> > > > >>>> It's new to me that create_gcov was removed.
> > > > >>>>
> > > > >>>> I tend to send patch to GCC that will remove AutoFDO from GCC.
> > > > >>>> I known Bin spent some time working on AutoFDO, has he came up to
> > > > >>> something?
> > > > >>>
> > > > >>> The GCC side of auto-FDO is not that hard.  We have most of
> > > > >>> infrastructure in place, but stopping point for me was always
> > > > difficulty
> > > > >>> to get gcov-tool working.  If some maintainer steps up, I think I
> > can
> > > > >>> fix GCC side.
> > > > >>>
> > > > >>> I am bit unsure how important feature it is - we have FDO that
> > works
> > > > >>> quite well for most users but I know there are some users of the
> > LLVM
> > > > >>> implementation and there is potential to tie this with other
> > hardware
> > > > >>> events to asist i.e. if conversion (where one wants to know how
> > well
> > > > CPU
> > > > >>> predicts the jump rather than just the jump probability) which I
> > always
> > > > >>> found potentially interesting.
> > > > >>>
> > > > >>> Honza
> > > > >>>>
> > > > >>>> Martin
> > > > >>>>
> > > > >>>>>
> > > > >>>>> Thanks,
> > > > >>>>>
> > > > >>>>> Eugene
> > > > >>>>>
> > > > >>>>
> > > > >>>
> > > >
> > > >
> >


Re: State of AutoFDO in GCC

2021-04-23 Thread Jan Hubicka
> Hi.
> 
> The current situation is that AutoFDO doesn't work with pretty simple 
> test-cases
> we have in testsuite:
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71672
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81379
> 
> These are ~5 years old and nothing has happened.
> 
> I'm pretty sure the current autofdo can't emit a .gcda file format that
> I've changed in the recent years.

I believe that the gcda files consumed by auto-FDO are not really
sharing the normal gcov file format, just the low level i/o.
See read_profile in auto-profile.c.

Honza
> 
> > 
> > I think if we want to keep the feature it makes sense to provide create_gcov
> > functionality either directly from perf (input data producer) or from gcc
> > (data consumer).  Of course I have no idea about its complexity, license
> > or implementation language ...
> 
> For me, it's just an i386 feature (maybe aarch64 has perf counters too?), 
> supported
> only by vendor (Intel) and I'm not planning working on that.
> I don't like having a feature that is obviously broken and potential GCC 
> users get
> bad experience every time they try to use it.
> 
> Can we at least deprecate the feature for GCC 11? If these is enough interest,
> we can fix it, if not, I would remove it in GCC 13 timeframe.
> 
> Thoughts?
> Martin
> 
> > 
> > Having the tool third-party makes keeping the whole chain working more
> > difficult.
> > 
> > Richard.
> > 
> >> Thanks,
> >>
> >> David
> >>
> >> On Thu, Apr 22, 2021 at 3:29 PM Jan Hubicka  wrote:
> >>
> >>>> On 4/22/21 9:58 PM, Eugene Rozenfeld via Gcc wrote:
> >>>>> GCC documentation for AutoFDO points to create_gcov tool that converts
> >>> perf.data file into gcov format that can be consumed by gcc with
> >>> -fauto-profile (https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html,
> >>> https://gcc.gnu.org/wiki/AutoFDO/Tutorial).
> >>>>>
> >>>>> I noticed that the source code for create_gcov has been deleted from
> >>> https://github.com/google/autofdo on April 7. I asked about that change
> >>> in that repo and got the following reply:
> >>>>>
> >>>>> https://github.com/google/autofdo/pull/107#issuecomment-819108738
> >>>>>
> >>>>> "Actually we didn't use create_gcov and havn't updated create_gcov for
> >>> years, and we also didn't have enough tests to guarantee it works (It was
> >>> gcc-4.8 when we used and verified create_gcov). If you need it, it is
> >>> welcomed to update create_gcov and add it to the respository."
> >>>>>
> >>>>> Does this mean that AutoFDO is currently dead in gcc?
> >>>>
> >>>> Hello.
> >>>>
> >>>> Yes. I know that even basic test cases have been broken for years in the
> >>> GCC.
> >>>> It's new to me that create_gcov was removed.
> >>>>
> >>>> I tend to send patch to GCC that will remove AutoFDO from GCC.
> >>>> I known Bin spent some time working on AutoFDO, has he came up to
> >>> something?
> >>>
> >>> The GCC side of auto-FDO is not that hard.  We have most of
> >>> infrastructure in place, but stopping point for me was always difficulty
> >>> to get gcov-tool working.  If some maintainer steps up, I think I can
> >>> fix GCC side.
> >>>
> >>> I am bit unsure how important feature it is - we have FDO that works
> >>> quite well for most users but I know there are some users of the LLVM
> >>> implementation and there is potential to tie this with other hardware
> >>> events to asist i.e. if conversion (where one wants to know how well CPU
> >>> predicts the jump rather than just the jump probability) which I always
> >>> found potentially interesting.
> >>>
> >>> Honza
> >>>>
> >>>> Martin
> >>>>
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> Eugene
> >>>>>
> >>>>
> >>>
> 


Re: State of AutoFDO in GCC

2021-04-23 Thread Jan Hubicka
> On Fri, Apr 23, 2021 at 10:27 AM Xinliang David Li 
> wrote:
> 
> >
> >
> > On Fri, Apr 23, 2021 at 10:16 AM Jan Hubicka  wrote:
> >
> >> >
> >> > It uses create_llvm_prof tool which is in the same git repo. The data
> >> > parsing part is shared with create_gcov, but the writer is obviously
> >> > different for the two tools.
> >>
> >> OK and what are the main differences between llvmand gcc format?
> >>
> >> GCC expects GCOV format, I think while LLVM uses a newly designed binary
> > format.
> >
> >
> Sorry I missed your next reply.  I forgot about the details of GCC' format.

Thanks for explanation.  How hard would be to make GCC consume this
format? Is is reasonably stable and documented somewhere?

Does LLVM's auto-FDO support non-Intel CPUs these days?

Honza
> 
> David
> >
> >
> >> Honza
> >> >
> >> > David
> >> >
> >> >
> >> > > Honza
> >> > > >
> >> > > > David
> >> > > >
> >> > > > >
> >> > > > > Thoughts?
> >> > > > > Martin
> >> > > > >
> >> > > > > >
> >> > > > > > Having the tool third-party makes keeping the whole chain
> >> working
> >> > > more
> >> > > > > > difficult.
> >> > > > > >
> >> > > > > > Richard.
> >> > > > > >
> >> > > > > >> Thanks,
> >> > > > > >>
> >> > > > > >> David
> >> > > > > >>
> >> > > > > >> On Thu, Apr 22, 2021 at 3:29 PM Jan Hubicka 
> >> wrote:
> >> > > > > >>
> >> > > > > >>>> On 4/22/21 9:58 PM, Eugene Rozenfeld via Gcc wrote:
> >> > > > > >>>>> GCC documentation for AutoFDO points to create_gcov tool
> >> that
> >> > > > > converts
> >> > > > > >>> perf.data file into gcov format that can be consumed by gcc
> >> with
> >> > > > > >>> -fauto-profile (
> >> > > > > https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html,
> >> > > > > >>> https://gcc.gnu.org/wiki/AutoFDO/Tutorial).
> >> > > > > >>>>>
> >> > > > > >>>>> I noticed that the source code for create_gcov has been
> >> deleted
> >> > > from
> >> > > > > >>> https://github.com/google/autofdo on April 7. I asked about
> >> that
> >> > > > > change
> >> > > > > >>> in that repo and got the following reply:
> >> > > > > >>>>>
> >> > > > > >>>>>
> >> > > https://github.com/google/autofdo/pull/107#issuecomment-819108738
> >> > > > > >>>>>
> >> > > > > >>>>> "Actually we didn't use create_gcov and havn't updated
> >> > > create_gcov
> >> > > > > for
> >> > > > > >>> years, and we also didn't have enough tests to guarantee it
> >> works
> >> > > (It
> >> > > > > was
> >> > > > > >>> gcc-4.8 when we used and verified create_gcov). If you need
> >> it, it
> >> > > is
> >> > > > > >>> welcomed to update create_gcov and add it to the respository."
> >> > > > > >>>>>
> >> > > > > >>>>> Does this mean that AutoFDO is currently dead in gcc?
> >> > > > > >>>>
> >> > > > > >>>> Hello.
> >> > > > > >>>>
> >> > > > > >>>> Yes. I know that even basic test cases have been broken for
> >> years
> >> > > in
> >> > > > > the
> >> > > > > >>> GCC.
> >> > > > > >>>> It's new to me that create_gcov was removed.
> >> > > > > >>>>
> >> > > > > >>>> I tend to send patch to GCC that will remove AutoFDO from
> >> GCC.
> >> > > > > >>>> I known Bin spent some time working on AutoFDO, has he came
> >> up to
> >> > > > > >>> something?
> >> > > > > >>>
> >> > > > > >>> The GCC side of auto-FDO is not that hard.  We have most of
> >> > > > > >>> infrastructure in place, but stopping point for me was always
> >> > > > > difficulty
> >> > > > > >>> to get gcov-tool working.  If some maintainer steps up, I
> >> think I
> >> > > can
> >> > > > > >>> fix GCC side.
> >> > > > > >>>
> >> > > > > >>> I am bit unsure how important feature it is - we have FDO that
> >> > > works
> >> > > > > >>> quite well for most users but I know there are some users of
> >> the
> >> > > LLVM
> >> > > > > >>> implementation and there is potential to tie this with other
> >> > > hardware
> >> > > > > >>> events to asist i.e. if conversion (where one wants to know
> >> how
> >> > > well
> >> > > > > CPU
> >> > > > > >>> predicts the jump rather than just the jump probability)
> >> which I
> >> > > always
> >> > > > > >>> found potentially interesting.
> >> > > > > >>>
> >> > > > > >>> Honza
> >> > > > > >>>>
> >> > > > > >>>> Martin
> >> > > > > >>>>
> >> > > > > >>>>>
> >> > > > > >>>>> Thanks,
> >> > > > > >>>>>
> >> > > > > >>>>> Eugene
> >> > > > > >>>>>
> >> > > > > >>>>
> >> > > > > >>>
> >> > > > >
> >> > > > >
> >> > >
> >>
> >


<    1   2   3   4   5   6   7   8   9   10   >