Re: [Bug ipa/114531] Feature proposal for an `-finline-functions-aggressive` compiler option

2024-06-25 Thread Jan Hubicka via Gcc-bugs
> different issue from the one that is raised in the PR.  (Unless we think that
> -O2 and -O3 should always have the same inlining heuristics henceforward, but
> that seems unlikely.)

Yes, I think point of -O3 is to let compiler to be more aggressive than
what seems desirable for your average distro build defaults (which needs
to balance speed and size).
> 
> At the moment, -O3 is essentially -O2 + some -f options + some --param 
> options.
>  Users who want to pick & chose some of the -f options can do so, and can add
> them to stable build systems.  Normally, obsolete -f options are turned into
> no-ops rather than removed.  But users can't pick & choose the --params, and
> add them to stable build systems, because we reserve the right to remove
> --params without warning.

Moreover those --params are slowly chaning their meaning in time.  I
need to retune inliner when early inlining gets smarter.
> 
> So IMO, we should have an -f option that represents “the inlining parameters
> enabled by -O3”, whatever they happen to be for a given release.  It's OK if
> the set is empty.
> 
> For such a change, it doesn't really matter whether the current --params are
> the right ones.  It just matters that the --params are the ones that we
> currently use.  If the --params are changed later, the -f option and -O3 will
> automatically stay in sync.

I am trying to understand how useful this is.  I am basically worried
about two things
 1) we have other optimization passes that behave differently at -O2 and
-O3 (vectorizer, unrolling etc.) and I think we may want to have
more. We also have -Os and -O1.

So perhaps we want kind of more systmatic solution. We already have
-fvect-cost-model that is kind of vectorizer version of the proposed
inliner option.
 2) inliner is already quite painful to tune. Especially since 
 one really needs to benchmark packages significantly bigger than
 SPECs which tends to be bit hard to set up and benchmark
 meaningfully. I usually do at least Firefox and clang where the
 first is always quite some work to get working well with latest
 GCC. We SUSE's LNT we also run "C++ behchmarks" which were
 initially collected as kind of inliner tests with higher
 abstraction penalty (tramp3d etc.).

 For many years I benchmarked primarily -O3 and -O3 + profile
 feedbcak on x86-64 only with ocassional look at -O2 and -Os
 behaviour which were generally more stable.
 I also tested other targets (poer and aarch64) but just
 sporadically, which is not good.

 After GCC5 I doubled testing to include both lto/non-lto variant.
 Since GCC10 -O2 started to envolve and needed re-testing too
 (lto/nonlto). One metric I know I ought to tune is -O2 -flto and
 FDO which used to be essentially -O3 before the optimization level
 --params were introduced, but now -O2 + FDO inlining is more
 conservative which hurts, for example, profiledbootstrapped GCC.

 So naturally I am bit worried to introduce even more combinations
 that needs testing and maintenance.  If we add user friendly way to
 tweak this, we also make a promise to keep it sane.


Re: [Bug c++/110137] implement clang -fassume-sane-operator-new

2024-06-04 Thread Jan Hubicka via Gcc-bugs
> Is the option supposed to be only about the standard global scope operator
> new/delete (_Znam etc.) or also user operator new/delete class methods?  If 
> the
> former, then I agree it is a global property (or at least a per shared
> library/binary property, one can arrange stuff with symbol visibility etc.).
> Otherwise it is a property of whatever operator new/delete you call.
> I think DECL_IS_OPERATOR_{NEW,DELETE} is set on both of these, the standard
> ones have
> also DECL_IS_REPLACEABLE_OPERATOR flag on them.
> Anyway, handling this just in IRA doesn't seem to be enough, surely it should
> be also handled during alias analysis etc.

I can add the TBAA and IPA bits (actually have WIP patch for that
already).  But there is also __builtion_operator_new and
__builtin_operator_delete that tags sanity of new/delete with per-call
sensitivity (since calls originating from std library are more opaque
then direct calls done by users).  So we need per-call sensitivity
anyway which can be done by altenrative decl or flag in gimple_call.

If user is crazy enough to do fancy tricks in new/delete I think it may
be controlled by program state or so (so for some code new/delete may be
sane while for other code it does not need to be0

So having this working well with LTO is IMO reasonable thing to do from
QOI point of view...


Re: [Bug libstdc++/109442] Dead local copy of std::vector not removed from function

2024-05-14 Thread Jan Hubicka via Gcc-bugs
This patch attempts to add __builtin_operator_new/delete. So far they
are not optimized, which will need to be done by extra flag of BUILT_IN_
code.  also the decl.cc code can be refactored to be less of cut
and I guess has_builtin hack to return proper value needs to be moved
to C++ FE.

However the immediate problem I run into is that libstdc++ testuiste
fails due to lack of std::nothrow overrides.  I wonder how to get that
working?
diff --git a/gcc/c-family/c-lex.cc b/gcc/c-family/c-lex.cc
index ff5ce2bf729..602b097059c 100644
--- a/gcc/c-family/c-lex.cc
+++ b/gcc/c-family/c-lex.cc
@@ -533,6 +533,10 @@ c_common_has_builtin (cpp_reader *pfile)
   if (!name)
 return 0;
 
+  if (!strcmp (name, "__builtin_operator_new")
+  || !strcmp (name, "__builtin_operator_delete"))
+return 201802L;
+
   return names_builtin_p (name);
 }
 
diff --git a/gcc/cp/decl.cc b/gcc/cp/decl.cc
index 3fc8835154d..90b100ca3dc 100644
--- a/gcc/cp/decl.cc
+++ b/gcc/cp/decl.cc
@@ -59,6 +59,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "omp-general.h"
 #include "omp-offload.h"  /* For offload_vars.  */
 #include "opts.h"
+#include "print-tree.h"
 #include "langhooks-def.h"  /* For lhd_simulate_record_decl  */
 
 /* Possible cases of bad specifiers type used by bad_specifiers. */
@@ -5048,13 +5049,27 @@ cxx_init_decl_processing (void)
 DECL_IS_MALLOC (opnew) = 1;
 DECL_SET_IS_OPERATOR_NEW (opnew, true);
 DECL_IS_REPLACEABLE_OPERATOR (opnew) = 1;
+tree builtin_opnew = build_cp_library_fn 
(get_identifier("__builtin_operator_new"),
+ NEW_EXPR, newtype, 0);
+DECL_IS_MALLOC (builtin_opnew) = 1;
+DECL_SET_IS_OPERATOR_NEW (builtin_opnew, true);
+DECL_IS_REPLACEABLE_OPERATOR (builtin_opnew) = 1;
+SET_DECL_ASSEMBLER_NAME (builtin_opnew, DECL_ASSEMBLER_NAME (opnew));
+pushdecl (builtin_opnew);
 opnew = push_cp_library_fn (VEC_NEW_EXPR, newtype, 0);
 DECL_IS_MALLOC (opnew) = 1;
 DECL_SET_IS_OPERATOR_NEW (opnew, true);
 DECL_IS_REPLACEABLE_OPERATOR (opnew) = 1;
+
 tree opdel = push_cp_library_fn (DELETE_EXPR, deltype, ECF_NOTHROW);
 DECL_SET_IS_OPERATOR_DELETE (opdel, true);
 DECL_IS_REPLACEABLE_OPERATOR (opdel) = 1;
+tree builtin_opdel = build_cp_library_fn 
(get_identifier("__builtin_operator_delete"),
+ DELETE_EXPR, deltype, 
ECF_NOTHROW);
+DECL_SET_IS_OPERATOR_DELETE (builtin_opdel, true);
+DECL_IS_REPLACEABLE_OPERATOR (builtin_opdel) = 1;
+SET_DECL_ASSEMBLER_NAME (builtin_opdel, DECL_ASSEMBLER_NAME (opdel));
+pushdecl (builtin_opdel);
 opdel = push_cp_library_fn (VEC_DELETE_EXPR, deltype, ECF_NOTHROW);
 DECL_SET_IS_OPERATOR_DELETE (opdel, true);
 DECL_IS_REPLACEABLE_OPERATOR (opdel) = 1;
@@ -5072,6 +5087,12 @@ cxx_init_decl_processing (void)
opdel = push_cp_library_fn (DELETE_EXPR, deltype, ECF_NOTHROW);
DECL_SET_IS_OPERATOR_DELETE (opdel, true);
DECL_IS_REPLACEABLE_OPERATOR (opdel) = 1;
+   builtin_opdel = build_cp_library_fn 
(get_identifier("__builtin_operator_delete"),
+DELETE_EXPR, deltype, ECF_NOTHROW);
+   DECL_SET_IS_OPERATOR_DELETE (builtin_opdel, true);
+   DECL_IS_REPLACEABLE_OPERATOR (builtin_opdel) = 1;
+   SET_DECL_ASSEMBLER_NAME (builtin_opdel, DECL_ASSEMBLER_NAME (opdel));
+   pushdecl (builtin_opdel);
opdel = push_cp_library_fn (VEC_DELETE_EXPR, deltype, ECF_NOTHROW);
DECL_SET_IS_OPERATOR_DELETE (opdel, true);
DECL_IS_REPLACEABLE_OPERATOR (opdel) = 1;
@@ -5094,6 +5115,13 @@ cxx_init_decl_processing (void)
DECL_IS_MALLOC (opnew) = 1;
DECL_SET_IS_OPERATOR_NEW (opnew, true);
DECL_IS_REPLACEABLE_OPERATOR (opnew) = 1;
+   builtin_opnew = build_cp_library_fn 
(get_identifier("__builtin_operator_new"),
+VEC_NEW_EXPR, newtype, 0);
+   DECL_IS_MALLOC (builtin_opnew) = 1;
+   DECL_SET_IS_OPERATOR_NEW (builtin_opnew, true);
+   DECL_IS_REPLACEABLE_OPERATOR (builtin_opnew) = 1;
+   SET_DECL_ASSEMBLER_NAME (builtin_opnew, DECL_ASSEMBLER_NAME (opnew));
+   pushdecl (builtin_opnew);
opnew = push_cp_library_fn (VEC_NEW_EXPR, newtype, 0);
DECL_IS_MALLOC (opnew) = 1;
DECL_SET_IS_OPERATOR_NEW (opnew, true);
@@ -5107,6 +5135,12 @@ cxx_init_decl_processing (void)
opdel = push_cp_library_fn (DELETE_EXPR, deltype, ECF_NOTHROW);
DECL_SET_IS_OPERATOR_DELETE (opdel, true);
DECL_IS_REPLACEABLE_OPERATOR (opdel) = 1;
+   builtin_opdel = build_cp_library_fn 
(get_identifier("__builtin_operator_delete"),
+DELETE_EXPR, deltype, ECF_NOTHROW);
+   DECL_SET_IS_OPERATOR_DELETE (builtin_opdel, true);
+   DECL_IS_REPLACEABLE_OPERATOR (builtin_opdel) = 1;
+   SET_DECL_ASSEMBLER_NAME (builtin_opdel, DECL_ASSEMBLER_NAME (opdel));
+   pushdecl 

Re: [Bug ipa/113907] [11/12/13/14 regression] ICU miscompiled on x86 since r14-5109-ga291237b628f41

2024-04-09 Thread Jan Hubicka via Gcc-bugs
There is still problem with loop bounds.  I am testing patch on that and
then we should be (finally) finally safe.


Re: [Bug ipa/114262] Over-inlining when optimizing for size with gnu_inline function

2024-03-07 Thread Jan Hubicka via Gcc-bugs
> Note GCC has not retuned its -Os heurstics for a long time because it has been
> decent enough for most folks and corner cases like this is almost never come
> up.
There were quite few changes to -Os heuristics :)
One of bigger challenges is that we do see more and more C++ code built
with -Os which relies on certain functions to be inlined and optimized
in context, so we had to get more optimistic in a hope that inlined code
will optimize well.

COMDAT functions are more likely inlined because statistics shows that
many of them are not really shared between translations units
(see -param=comdat-sharing-probability parameter). This was necessary to
get reasonable code for Firefox approx 15 years ago.


Re: [Bug target/114232] [14 regression] ICE when building rr-5.7.0 with LTO on x86

2024-03-05 Thread Jan Hubicka via Gcc-bugs
Looking at the prototype patch, why need to change also the splitters?

My original goal was to use splitters to expand to faster code sequences
while having patterns necessary for both variants.  This makes it
possible to use optimize_insn_for_size/speed and make decisions using BB
profile, since we will not ICE if the hotness of BB changes later.


Re: [Bug tree-optimization/113787] [12/13/14 Regression] Wrong code at -O with ipa-modref on aarch64

2024-02-14 Thread Jan Hubicka via Gcc-bugs
> > I guess PTA gets around by tracking points-to set also for non-pointer
> > types and consequently it also gives up on any such addition.
> 
> It does.  But note it does _not_ for POINTER_PLUS where it treats
> the offset operand as non-pointer.
> 
> > I think it is ipa-prop.c::unadjusted_ptr_and_unit_offset. It accepts
> > pointer_plus expression, but does not look through POINTER_PLUS.
> > We can restrict it further, but tracking base pointer is quite useful,
> > so it would be nice to not give up completely.
> 
> It looks like that function might treat that
> 
>  ADDR_EXPR >
> 
> as integer_zerop base.  It does
> 
>   if (TREE_CODE (op) == ADDR_EXPR) 
> {
>   poly_int64 extra_offset = 0; 
>   tree base = get_addr_base_and_unit_offset (TREE_OPERAND (op, 0),
>  );
>   if (!base)
> {
>   base = get_base_address (TREE_OPERAND (op, 0));
>   if (TREE_CODE (base) != MEM_REF)
> break;
>   offset_known = false;
> }
>   else
> {
>   if (TREE_CODE (base) != MEM_REF)
> break;
> 
> with a variable offset we fall to the TREE_CODE (base) != MEM_REF
> and will have offset_known == true.  Not sure what it does with
> the result though (it's not the address of a decl).
> 
> This function seems to oddly special-case != MEM_REF ... (maybe
> it wants to hande DECL_P () as finishing?

Hmm the function was definitely not written with TARGET_MEM_REF in mind,
since it was originally used for IPA passes only.
We basically want to handle stuff like
 >foo
or
 &(ptr->foo)
In the second case we want to continue the SSA walk to hopefully work
out the origin of PTR.
ipa-modref then looks if the base pointer is derived from function
parameter or points to local or readonly memory to produce its summary.
> 
> Note get_addr_base_and_unit_offset will return NULL for
> a TARGET_MEM_REF <, ..., offset> but TARGET_MEM_REF
> itself if the base isn't an ADDR_EXPR, irrespective of whether
> the offset within it is constant or not.

Hmm, interesting.  I would expect it to interpret the emantics of TMR
and return base.
> 
> Not sure if the above is a problem, but it seems the only caller
> will just call points_to_local_or_readonly_memory_p on the
> ADDR_EXPR where refs_local_or_readonly_memory_p via
> points_to_local_or_readonly_memory_p will eventually do
> 
>   /* See if memory location is clearly invalid.  */
>   if (integer_zerop (t))
> return flag_delete_null_pointer_checks;
> 
> and that might be a problem.  As said, we rely on
> ADDR_EXPR  > to be an address computation
> that's not subject to strict interpretation to allow IVOPTs
> doing this kind of optimization w/o introducing some kind of
> INTEGER_LEA <...>.  I know that's a bit awkward but we should
> make sure this is honored by IPA as well.
> 
> I'd say
> 
> diff --git a/gcc/ipa-fnsummary.cc b/gcc/ipa-fnsummary.cc
> index 74c9b4e1d1e..45a770cf940 100644
> --- a/gcc/ipa-fnsummary.cc
> +++ b/gcc/ipa-fnsummary.cc
> @@ -2642,7 +2642,8 @@ points_to_local_or_readonly_memory_p (tree t)
> return true;
>return !ptr_deref_may_alias_global_p (t, false);
>  }
> -  if (TREE_CODE (t) == ADDR_EXPR)
> +  if (TREE_CODE (t) == ADDR_EXPR
> +  && TREE_CODE (TREE_OPERAND (t, 0)) != TARGET_MEM_REF)
>  return refs_local_or_readonly_memory_p (TREE_OPERAND (t, 0));
>return false;
>  }
> 
> might eventually work?  Alternatively a bit less aggressive like
> the following.
> 
> diff --git a/gcc/ipa-fnsummary.cc b/gcc/ipa-fnsummary.cc
> index 74c9b4e1d1e..7c79adf6440 100644
> --- a/gcc/ipa-fnsummary.cc
> +++ b/gcc/ipa-fnsummary.cc
> @@ -2642,7 +2642,9 @@ points_to_local_or_readonly_memory_p (tree t)
> return true;
>return !ptr_deref_may_alias_global_p (t, false);
>  }
> -  if (TREE_CODE (t) == ADDR_EXPR)
> +  if (TREE_CODE (t) == ADDR_EXPR
> +  && (TREE_CODE (TREE_OPERAND (t, 0)) != TARGET_MEM_REF
> + || TREE_CODE (TREE_OPERAND (TREE_OPERAND (t, 0), 0)) != 
> INTEGER_CST))
>  return refs_local_or_readonly_memory_p (TREE_OPERAND (t, 0));
>return false;
>  }

Yes, those both looks reasonable to me, perhaps less agressive would be
better. 
> 
> A "nicer" solution might be to add a informational operand
> to TARGET_MEM_REF, representing the base pointer to be used for
> alias/points-to purposes.  But if that's not invariant it might
> keep some otherwise unnecessary definition stmts live.

Yep, I see that forcing extra IV to track original semantics would be
trouble here.  I think that after iv-opts we should be done with more
fancy propagation across loops.

However, to avoid ipa-modref summary degradation, perhaps scheduling the
pass before ivopts would make sense...

Thanks,
Honza


Re: [Bug target/113233] LoongArch: target options from LTO objects not respected during linking

2024-01-04 Thread Jan Hubicka via Gcc-bugs
> Confirm.  But option save/restore has been always implemented:
> 
> .section.gnu.lto_.opts,"",@progbits
> .ascii  "'-fno-openmp' '-fno-openacc' '-fno-pie' '-fcf-protection"
> .ascii  "=none' '-mabi=lp64d' '-march=loongarch64' '-mfpu=64' '-m"
> .ascii  "simd=lasx' '-mcmodel=normal' '-mtune=loongarch64' '-flto"
> .ascii  "'\000"
> 
> So -msimd=lasx is correctly recorded.  Not sure why it does not work.

With LTO we need to mix code compiled with different sets of options.
For this reason we imply for every function defition and optimization
and target attribute which record the flags.  So it seems target
attribute is likely broken for this flag.


Re: [Bug middle-end/111088] useless 'xor eax,eax' inserted when a value is not returned and icf

2023-08-21 Thread Jan Hubicka via Gcc-bugs
> But adds a return with a value. And then the inliner inlines foo into foo2 but
> we still have the return with a value around ...
I guess ICF can special case unused return value, but why this is not
taken care of by ipa-sra?


Re: [Predicated Ins vs Branches] O3 and PGO result in 2x performance drop relative to O2

2023-08-01 Thread Jan Hubicka via Gcc-bugs
> > If I comment it out as above patch, then O3/PGO can get 16% and 12% 
> > performance
> > improvement compared to O2 on x86.
> >
> > O2  O3  PGO
> > cycles  2,497,674,824   2,104,993,224   2,199,753,593
> > instructions10,457,508,646  9,723,056,131   10,457,216,225
> > branches2,303,029,380   2,250,522,323   2,302,994,942
> > branch-misses   0.00%   0.01%   0.01%
> >
> > The main difference in the compilation output about code around the 
> > miss-prediction
> > branch is:
> >   o In O2: predicated instruction (cmov here) is selected to eliminate above
> > branch. cmov is true better than branch here.
> >   o In O3/PGO: bitout() is inlined into encode_file(), and branch 
> > instruction
> > is selected. But this branch is obviously *unpredictable* and the 
> > compiler
> > doesn't know it. This why O3/PGO are are so bad for this program.
> >
> > Gcc doesn't support __builtin_unpredictable() which has been introduced by 
> > llvm.
> > Then I tried to see if __builtin_expect_with_probability(e,x, 0.5) can 
> > serve the
> > same purpose. The result is negative.
> 
> But does it appear to be predictable with your profiling data?

Also one thing is that __builtin_expect and
__builtin_expect_with_probability only affects the static branch
prediciton algorithm, so with profile feedback they are ignored on every
branch executed at least once during the train run.

setting probability 0.5 is really not exactly the same as hint that the
branch will be mispredicted, since modern CPUs handle well regularly
behaving branchs (such as a branch firing every even iteration of loop).

So I think having the builting is not a bad idea.  I was thinking if it
makes sense to represent it withing profile_probability type and I am
not convinced, since "unpredictable probability" sounds counceptually
odd and we would need to keep the flag intact over all probability
updates we do.  For things like loop exits we recompute probabilities
from frequencies after unrolling/vectorizaiton and other things and we
would need to invent new API to propagate the flag from previous
probability (which is not even part of the computation right now)

So I guess the challenge is how to pass this info down through the
optimization pipeline, since we would need to annotate gimple
conds/switches and manage it to RTL level.  On gimple we have flags and
on rtl level notes so there is space for it, but we would need to
maintain the info through CFG changes.

Auto-FDO may be interesting way to detect such branches.

Honza
> 
> > I think we could come to a conclusion that there must be something can 
> > improve in
> > Gcc's heuristic strategy about Predicated Instructions and branches, at 
> > least
> > for O3 and PGO.
> >
> > And can we add __builtin_unpredictable() support for Gcc? As usually it's 
> > hard
> > for the compiler to detect unpredictable branches.
> >
> > --
> > Cheers,
> > Changbin Du


Re: [Bug tree-optimization/106293] [13/14 Regression] 456.hmmer at -Ofast -march=native regressed by 19% on zen2 and zen3 in July 2022

2023-07-28 Thread Jan Hubicka via Gcc-bugs
> This heuristic wants to catch
> 
>   
>   if (foo) abort ();
>   
> 
> and avoid sinking "too far" across a path with "similar enough"
> execution count (I think the original motivation was to fix some
> spilling / register pressure issue).  The loop depth test
> should be !(bb_loop_depth (best_bb) < bb_loop_depth (early_bb))

I am still concenred that loop_depth (bb1) < loop_depth (bb2)
does not really imply that bb1 is not in different loop nest with
loop with significantly higher iteration count than bb2...
> so we shouldn't limit sinking to a more outer nest.  As we rule
> out > before this becomes ==.
> 
> It looks tempting to sink to the earliest place with the same
> execution count rather than the latest but the above doesn't
> really achive that (it doesn't look "upwards" but simply fails).
> With a guessed profile it's also going to be hard.

Statistically guessed profile works quite well for things like placement
of splills in IRA (not perfectly of course) and this looks like kind of
similar thing.  So perhaps it could work reasoably well...
> 
> And it in no way implements register pressure / spilling sensitivity
> (see also Ajits attempts at producing a patch that avoids sinking
> across a call).  All these are ultimatively doomed unless we at least
> consider a group of stmts together.

hmm, life is hard :)
Honza


Re: [Bug target/110758] [14 Regression] 8% hmmer regression on zen1/3 with -Ofast -march=native -flto between g:8377cf1bf41a0a9d (2023-07-05 01:46) and g:3a61ca1b9256535e (2023-07-06 16:56); g:d76d19c

2023-07-21 Thread Jan Hubicka via Gcc-bugs
> I suspect this is most likely the profile updates changes ...
Quite possibly. The goal of this excercise is to figure out if there are
some bugs in profile estimate or whether passes somehow preffer broken
profile or if it is just back luck.

Looking at sphinx and fatigue it seems that LRA really may preffer
increased profile counts in peeled vectorized loop since it does not
understand the fact that putting spill on critical path through the
dependnecy graph of the code is not good for out of order execution.


Re: [Bug ipa/110334] [13/14 Regresssion] unused functions not eliminated before LTO streaming

2023-06-28 Thread Jan Hubicka via Gcc-bugs
> 
> why disallow caller->indirect_calls?
See testcase in comment #9
> 
> > +   return false;
> > +  for (cgraph_edge *e2 = callee->callees; e2; e2 = e2->next_callee)
> 
> I don't think this flys - it looks quadratic.  Can we compute this
> in the inline summary once instead?

I guess I can place a cache there.  I think this check will become more
global over time so it more fits IMO here.
> 
> As for indirect calls, can we maybe mark initial direct GIMPLE call
> stmts as "always-inline" and only look at that marking, thus an
> indirect call will never become "always-inline"?  Iff cgraph edges
> prevail during all early inlining we could mark call edges for
> this purpose?

I also think we need call site specific info.
Tagging gimple call statements and copying the info to gimple edges will
probably be needed here.  We want to keep the info from early inlining
to late inlining since we output errors late.
We already have plenty of GF_CALL_ flags, so adding one should be easy?

Honza


Re: [Bug ipa/110334] [13/14 Regresssion] unused functions not eliminated before LTO streaming

2023-06-23 Thread Jan Hubicka via Gcc-bugs
Just so it is somewhere, here is a testcase that we can't inline leaf
functions to always_inlines unless we do some tracking of what calls
were formerly indirect calls.

We really overloaded always_inline from the original semantics "drop
inlining heuristics" into "be sure that result is inlined" while for
the second it does not make sense to take its address.
Clang apparently simply does not error on failes always inlines which
makes its life easier.

int n;
typedef void (*fnptr)();
fnptr get_me();
__attribute__ ((always_inline))
inline void test(void)
{
if (n < 10)
  (get_me())();
n++;
return;
}
fnptr get_me()
{
return test;
}
void
foo()
{
test();
}



Re: [Bug libstdc++/110287] _M_check_len is expensive

2023-06-19 Thread Jan Hubicka via Gcc-bugs
> 
> There is no guarantee that std::vector::max_size() is PTRDIFF_MAX. It
> depends on the Allocator type, A. A user-defined allocator could have
> max_size() == 100.

If inliner we see path to the throw functions, it will not determine
_M_check_len as early inlinable.
Perhaps we can __builtin_constant_p it as well and check that
max_size () * sizeof ()
is close to ptrdiff_max.  

Thanks for the comments on the patches.  I will try to update the patch.

I was wondering about the allocators. As shown in the mail, optimiznig
_M_check_len still leaves two independent throws for insanely large
ops.  Since allocator is user replaceable, I guess we can not add new
member function for safe_allocate or so.

We can use __builtin_unreachable to set the value range on the return
value.  For that to work during early optimizations we need 

 1) extend early VRP to retrofit the value determined by
__builtin_unreachable to the SSA name defned earlier based on fact
that the execution can not legally terminate in between
 2) teaching inliner to ignore conditionals guaring __builtin_unreacable
 3) add support for return functions to propagate the value range from
_M_check_len to _M_reallocate_insert.
so it is correctly propagated to allocator call.

This is not very easy, but can be generally useful elsewhere.


Re: [Bug c++/106943] GCC building clang/llvm with LTO flags causes ICE in clang

2023-05-12 Thread Jan Hubicka via Gcc-bugs
> > Indeed it is quite long time problem with clang not building with lifetime
> > DSE and strict aliasing.  I wonder why this is not fixed on clang side?
> 
> Because the problems were not communicated? I knew that Firefox needed
> -flifetime-dse=1, but it's the first time I hear that any such problems in
> Clang/LLVM were identified.
> 
> I could not find any workaround for lifetime-dse in SUSE spec file for llvm16.
> Are you saying it was known and worked around somehow? Or it is not 
> manifesting
> because LLVM is built without LTO?

I think opensuse package outs-out LTO probably for this reason.  I am
sometimes using LLVM as benchmark of LTO and PGO, so it would be great
to have this enabled in the packages, but I had no time to do that so
far.  LLVM built with LTO and PGO builds quite a lot faster.  I was
filling bugreport for that some time ago and it seems that the bugreport
linked above has quite good analysis about what breaks.


Re: [Bug target/87832] AMD pipeline models are very costly size-wise

2022-11-16 Thread Jan Hubicka via Gcc-bugs
> 
> Do you mean we should fix modeling of divisions there as well? I don't have
> latency/throughput measurements for those CPUs, nor access so I can run
> experiments myself, unfortunately.
> 
> I guess you mean just making a patch to model division units separately,
> leaving latency/throughput as in current incorrect models, and leave it to
> manufacturers to correct it? Alternatively, for AMD Bobcat and Bulldozer we
> might be able to crowd-source it eventually.
Actually for older cores I think the manufacturers do not care much.  I
still have a working Bulldozer machine and I can do some testing.
I think in Buldozer case I was basing the latency throughput on data in
Agner Fog's manuals.  How do you test it?
Honza


Re: [Bug middle-end/106078] Invalid loop invariant motion with non-call-exceptions

2022-06-25 Thread Jan Hubicka via Gcc-bugs
> > For this one it's PRE hoisting *b across the endless loop (PRE handles
> > calls as possibly not returning but not loops as possibly not 
> > terminating...)
> > So it's a different bug.
> 
> Btw, C++ requiring forward progress makes the testcase undefined.
In my understanding access to volatile variable is a forward progres:
In a valid C++ program, every thread eventually does one of the
following:

   -terminate
   -makes a call to an I/O library function
   -performs an access through a volatile glvalue
   -performs an atomic operation or a synchronization operation 

I think one can also replace volatile access by atomics: we only need to
know the side effects of that operation.
Honza


Re: [Bug lto/105727] __builtin_constant_p expansion in LTO

2022-05-25 Thread Jan Hubicka via Gcc-bugs
> > My guess is that the
> > BUILD_BUG();
> > line is the sole thing that is wrong, it should be just break;
> > as the memory_is_poisoned_n(addr, size); will handle all the sizes,
> > regardless if they are constants or not.
> 
> Sure, I'm going to suggest such a change.
To me it looked like a protection that size is not going to be large
(or perhaps author wants to add extra special cases as they are needed)

Honza


Re: [Bug c/105728] New: dead store to static var not optimized out

2022-05-25 Thread Jan Hubicka via Gcc-bugs
> To me, all of these do the same thing and should generate the same code.
> As nobody else can see removeme, and we aren't leaking its address, shouldn't
> the compiler be able to deduce that all accesses to removeme are
> inconsequential and can be removed?
> 
> My gcc 11.3 generates a condidion and a store and a return 0 for dummy1, the
> same thing for dummy2, but for dummy3 it understands that it only needs to 
> emit
> a return 0.

GCC detects "write olny" variables and that is what matches for dummy3.
I am not 100% sure it is valid to do the optimization in other two cases
since when multiple threads are considered. In any case we lack tracking
of constants stored to global variables which is something ipa-cp can be
extended to.

Honza


Re: [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22

2022-01-27 Thread Jan Hubicka via Gcc-bugs
> I would say so.  It saves code size and also uop space unless the two
> can magically fuse to a immediate to %xmm move (I doubt that).
I made simple benchmark

double a=10;
int
main()
{
long int i;
double sum,val1,val2,val3,val4;
 for (i=0;i<10;i++)
 {
#if 1
#if 1
asm __volatile__("movabsq $0x3ff03db8fde2ef4e, %%r8;vmovq   
%%r8, %0": "=x"(val1): :"r8","xmm11");
asm __volatile__("movabsq $0x3ff03db8fde2ef4e, %%r8;vmovq   
%%r8, %0": "=x"(val2): :"r8","xmm11");
asm __volatile__("movabsq $0x3ff03db8fde2ef4e, %%r8;vmovq   
%%r8, %0": "=x"(val3): :"r8","xmm11");
asm __volatile__("movabsq $0x3ff03db8fde2ef4e, %%r8;vmovq   
%%r8, %0": "=x"(val4): :"r8","xmm11");
#else
asm __volatile__("movq %1, %%r8;vmovq   %%r8, %0": 
"=x"(val1):"m"(a) :"r8","xmm11");
asm __volatile__("movq %1, %%r8;vmovq   %%r8, %0": 
"=x"(val2):"m"(a) :"r8","xmm11");
asm __volatile__("movq %1, %%r8;vmovq   %%r8, %0": 
"=x"(val3):"m"(a) :"r8","xmm11");
asm __volatile__("movq %1, %%r8;vmovq   %%r8, %0": 
"=x"(val4):"m"(a) :"r8","xmm11");
#endif
#else
asm __volatile__("vmovq   %1, %0": "=x"(val1):"m"(a) 
:"r8","xmm11");
asm __volatile__("vmovq   %1, %0": "=x"(val2):"m"(a) 
:"r8","xmm11");
asm __volatile__("vmovq   %1, %0": "=x"(val3):"m"(a) 
:"r8","xmm11");
asm __volatile__("vmovq   %1, %0": "=x"(val4):"m"(a) 
:"r8","xmm11");
#endif
sum+=val1+val2+val3+val4;
 }
 return sum;

and indeed the third variant runs 1.2s while the first two takes equal
time 2.4s on my zen2 laptop.


Re: [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22

2022-01-27 Thread Jan Hubicka via Gcc-bugs
> > According to znver2_cost
> > 
> > Cost of sse_to_integer is a little bit less than fp_store, maybe increase
> > sse_to_integer cost(more than fp_store) can helps RA to choose memory
> > instead of GPR.
> 
> That sounds reasonable - GPR<->xmm is cheaper than GPR -> stack -> xmm
> but GPR<->xmm should be more expensive than GPR/xmm<->stack.  As said above
> Zen2 can do reg -> mem, mem -> reg via renaming if 'mem' is somewhat special,
> but modeling that doesn't seem to be necessary.
> 
> We seem to have store costs of 8 and load costs of 6, I'll try bumping the
> gpr<->xmm move cost to 8.

I was simply following latencies here, so indeed reg<->mem bypass is not
really modelled.  I recall doing few experiments which was kind of
inconclusive.


Re: [Bug tree-optimization/104203] [12 Regressions] huge compile-time regression since r12-6606-g9d6a0f388eb048f8

2022-01-24 Thread Jan Hubicka via Gcc-bugs
> > bool
> Since the pass issues a bunch other warnings (e.g., -Wstringop-overflow,
> -Wuse-after-free, etc.) the gate doesn't seem right.  But since #pragma GCC
> diagnostic can re-enable warnings disabled by -w (or turn them into errors) 
> any
> gate that considers the global option setting will also interfere with that.

What the gate is executed the flags are set according to cfun, so you
can just combine all warning options for warnings issued by the pass
into the gate.


Re: [Bug ipa/104203] [12 Regressions] huge IPA compile-time regression since r12-6606-g9d6a0f388eb048f8

2022-01-24 Thread Jan Hubicka via Gcc-bugs
So I assume that this is due to new pass_waccess which was added into
early optimizations.  I think this is not really ipa component but
tree-optimize.


Re: [Bug tree-optimization/103195] [12 Regression] tfft2 text grows by 70% with -Ofast since r12-5113-gd70ef65692fced7a

2022-01-18 Thread Jan Hubicka via Gcc-bugs
> So nothing to see?  I guess our unit growth limit doesn't trigger because it's
> a small (benchmark) unit?
Yep, unit growths do not apply for very small units.  ipa-cp heuristics
still IMO needs work and be based on relative speedups rather then
absolute for the cutoffs.


Re: [Bug tree-optimization/103989] [12 regression] std::optional and bogus -Wmaybe-unitialized at -Og since r12-1992-g6feb628a706e86eb

2022-01-13 Thread Jan Hubicka via Gcc-bugs
> 
> Sure - I just remember (falsely?) that we finally decided to do it :)

I do not recall this, but I may have forgotten :))

> If we don't run IPA inline we don't figure we failed to inline the
> always_inline either ;)  And IPA inline can expose more indirect
> alywas-inlines we only discover after even more optimization so the
> issue is really moot unless we sorry () (or link-fail).

Problem with kernel was that it relied on quite complicated indirect
inliing of always inlined and did not work without it.  At beggining I
think we should have introduced two attributes - always_inline and
disregard_inline_limits just like we have internally. Always_inline
should have never allowed public linkage or taking its address, but
it is probbly late to fix that :(

Honza


Re: [Bug tree-optimization/103989] [12 regression] std::optional and bogus -Wmaybe-unitialized at -Og since r12-1992-g6feb628a706e86eb

2022-01-13 Thread Jan Hubicka via Gcc-bugs
> You can not disable an IPA pass becasuse then we will mishandle
> optimize attributes.  I think you simply want to set
> 
> flag_inline_small_functions = 0
> flag_inline_functions_called_once = 0 

Actually I forgot, we have flag_no_inline which makes
tree_inlinable_function_p to return false for everything except for
ALWAYS_INLINE and so we only want to set this one for Og.



Re: [Bug tree-optimization/103989] [12 regression] std::optional and bogus -Wmaybe-unitialized at -Og since r12-1992-g6feb628a706e86eb

2022-01-13 Thread Jan Hubicka via Gcc-bugs
> --- Comment #6 from Richard Biener  ---
> Honza, -Og was supposed to not do so much work, I intended to disable IPA
> inlining but there's no knob for that.  I wonder where to best put such
> guard?  I set flag_inline_small_functions to zero for -Og but we still
> run inline_small_functions ().  Basically -Og was supposed to only do
> early opts and then what is necessary for correct RTL expansion.  Doing
> IPA inlining defeats this :/
> 
> Can you help?  Is it safe to simply gate the inline_small_functions ()
> call?  Do we want an extra -f[no-]ipa-inline like we have -fearly-inlining?
> 
> Using -fdisable-ipa-inline gets rid of the diagnostic

You can not disable an IPA pass becasuse then we will mishandle
optimize attributes.  I think you simply want to set

flag_inline_small_functions = 0
flag_inline_functions_called_once = 0 

and we should only inline always_inlines. inline_small_functions will
still loop and check inlinability of functions but if everything is
compiled with -Og it will not find anything inlinable and exit.

Perhaps we may also extend initialize_inline_failed to add
CIF_DEBUG_OPTIMIZE so -Winline does say something more useufl then
"function not considered"

Honza


Re: [Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies

2022-01-11 Thread Jan Hubicka via Gcc-bugs
on zen2 and 3 with -flto the speedup seems to be cca 12% for both -O2
and -Ofast -march=native which is both very nice!
Zen1 for some reason sees less improvement, about 6%.
With PGO it is 3.8%

Overall it seems a win, but there are few noteworthy issues.

I also see a 6.69% regression on x64 with -Ofast -march=native -flto
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=475.377.0
and perhaps 3-5% on sphinx
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=476.280.0
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=227.280.0

For non-spec benchmarks spec there is a regression on nbench
https://lnt.opensuse.org/db_default/v4/CPP/graph?plot.0=26.645.1
There are also large changes in tsvc
https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report
it may be noise since kernels are tiny, but for example x293 reproduces
both on kabylake and zen by about 80-90% regression that may be easy to
track (the kernel is included in the testsuite). Same regression is not
seen on zen3, so may be an ISA specific or so.

FInally there seems relatively large code size savings on polyhedron
benchmarks today (8% on capacita, 

Thanks a lot!


Re: [Bug gcov-profile/103652] Producing profile with -O2 -flto and trying to consume it with -O3 -flto leads to ICEs on indirect call profiling

2021-12-13 Thread Jan Hubicka via Gcc-bugs
> 
> Well, I'm specifically speaking about:
> error: the control flow of function ‘BZ2_compressBlock’ does not match its
> profile data (counter ‘arcs’) 
> 
> this type of errors should not happen even in a multi-threaded programs.

There are some cases where I see even those on clang build - I am not
sure how that happens (if it is configury difference or generated code
or gcc bug) It is on my TODO to analyse...

In any case we should never ICE on malformed gcda files. Especially not
by buffer overflow :)
> 
> > I think you can produce testcase easily by making a function with one
> > indirect call for train run and many indirect calls in profile-use run.
> > 
> > I have patch to avoid the buffer overflow - can send it after getting to
> > office.
> 
> Sure, please send it.
Attached.

Honza
diff --git a/gcc/coverage.c b/gcc/coverage.c
index 7f8b532cb52..49c370cb8c8 100644
--- a/gcc/coverage.c
+++ b/gcc/coverage.c
@@ -296,7 +296,7 @@ read_counts_file (void)
 
 gcov_type *
 get_coverage_counts (unsigned counter, unsigned cfg_checksum,
-unsigned lineno_checksum, unsigned int n_counts)
+unsigned lineno_checksum, unsigned int *n_counts)
 {
   counts_entry *entry, elt;
 
@@ -348,12 +348,12 @@ get_coverage_counts (unsigned counter, unsigned 
cfg_checksum,
   if (entry->cfg_checksum != cfg_checksum
   || (counter != GCOV_COUNTER_V_INDIR
  && counter != GCOV_COUNTER_V_TOPN
- && entry->n_counts != n_counts))
+ && entry->n_counts != *n_counts))
 {
   static int warned = 0;
   bool warning_printed = false;
 
-  if (entry->n_counts != n_counts)
+  if (entry->n_counts != *n_counts)
warning_printed =
  warning_at (DECL_SOURCE_LOCATION (current_function_decl),
  OPT_Wcoverage_mismatch,
@@ -361,7 +361,7 @@ get_coverage_counts (unsigned counter, unsigned 
cfg_checksum,
  "does not match "
  "its profile data (counter %qs, expected %i and have %i)",
  current_function_decl,
- ctr_names[counter], entry->n_counts, n_counts);
+ ctr_names[counter], entry->n_counts, *n_counts);
   else
warning_printed =
  warning_at (DECL_SOURCE_LOCATION (current_function_decl),
@@ -404,9 +404,25 @@ get_coverage_counts (unsigned counter, unsigned 
cfg_checksum,
  current_function_decl);
 }
 
+  *n_counts = entry->n_counts;
   return entry->counts;
 }
 
+/* Returns the counters for a particular tag and verifies that counts matches
+   the expectation.  */
+
+gcov_type *
+get_coverage_counts (unsigned counter, unsigned cfg_checksum,
+unsigned lineno_checksum, unsigned int n_counts)
+{
+  unsigned int n_counts2 = n_counts;
+  gcov_type *ret
+ = get_coverage_counts (counter, cfg_checksum,
+lineno_checksum, _counts2);
+  gcc_assert (!ret || n_counts2 == n_counts);
+  return ret;
+}
+
 /* Allocate NUM counters of type COUNTER. Returns nonzero if the
allocation succeeded.  */
 
diff --git a/gcc/coverage.h b/gcc/coverage.h
index 22646d439fc..7f488811a4e 100644
--- a/gcc/coverage.h
+++ b/gcc/coverage.h
@@ -54,6 +54,10 @@ extern gcov_type *get_coverage_counts (unsigned /*counter*/,
   unsigned /*cfg_checksum*/,
   unsigned /*lineno_checksum*/,
   unsigned /*n_counts*/);
+extern gcov_type *get_coverage_counts (unsigned /*counter*/,
+  unsigned /*cfg_checksum*/,
+  unsigned /*lineno_checksum*/,
+  unsigned */*n_counts*/);
 
 extern tree get_gcov_type (void);
 extern bool coverage_node_map_initialized_p (void);
diff --git a/gcc/profile.c b/gcc/profile.c
index d4103058fcd..0fe0910c296 100644
--- a/gcc/profile.c
+++ b/gcc/profile.c
@@ -898,7 +898,7 @@ compute_value_histograms (histogram_values values, unsigned 
cfg_checksum,
   histogram_counts[t] = get_coverage_counts (COUNTER_FOR_HIST_TYPE (t),
 cfg_checksum,
 lineno_checksum,
-n_histogram_counters[t]);
+_histogram_counters[t]);
   if (histogram_counts[t])
any = 1;
   act_count[t] = histogram_counts[t];
@@ -918,20 +918,47 @@ compute_value_histograms (histogram_values values, 
unsigned cfg_checksum,
   /* TOP N counter uses variable number of counters.  */
   if (topn_p)
{
- unsigned total_size;
+ gcov_type total_size;
+ bool ignore = false;
  if (act_count[t])
-   total_size = 2 + 2 * act_count[t][1];
+   {
+ total_size = 2 + 2 * act_count[t][1];
+ /* Watch for counter corruption

Re: [Bug tree-optimization/103168] Value numbering for PRE of pure functions can be improved

2021-11-22 Thread Jan Hubicka via Gcc-bugs
The patch passed testing on x86_64-linux.


Re: [Bug tree-optimization/103168] Value numbering for PRE of pure functions can be improved

2021-11-22 Thread Jan Hubicka via Gcc-bugs
This is bit modified patch I am testing.  I added pre-computation of the
number of accesses, enabled the path for const functions (in case they
have memory operand), initialized alias sets and clarified the logic
around every_* and global_memory_accesses

PR tree-optimization/103168
(modref_summary::finalize): Initialize load_accesses.
* ipa-modref.h (struct modref_summary): Add load_accesses.
* tree-ssa-sccvn.c (visit_reference_op_call): Use modref
info to walk the virtual use->def chain to CSE pure
function calls.

* g++.dg/tree-ssa/pr103168.C: New testcase.

diff --git a/gcc/ipa-modref.c b/gcc/ipa-modref.c
index 4f9323165ea..595eb6e0d8f 100644
--- a/gcc/ipa-modref.c
+++ b/gcc/ipa-modref.c
@@ -725,6 +727,23 @@ modref_summary::finalize (tree fun)
break;
}
 }
+  if (loads->every_base)
+load_accesses = 1;
+  else
+{
+  load_accesses = 0;
+  for (auto base_node : loads->bases)
+   {
+ if (base_node->every_ref)
+   load_accesses++;
+ else
+   for (auto ref_node : base_node->refs)
+ if (ref_node->every_access)
+   load_accesses++;
+ else
+   load_accesses += ref_node->accesses->length ();
+   }
+}
 }
 
 /* Get function summary for FUNC if it exists, return NULL otherwise.  */
diff --git a/gcc/ipa-modref.h b/gcc/ipa-modref.h
index f868eb6de07..a7937d74945 100644
--- a/gcc/ipa-modref.h
+++ b/gcc/ipa-modref.h
@@ -53,6 +53,8 @@ struct GTY(()) modref_summary
 
   /* Flags coputed by finalize method.  */
 
+  /* Total number of accesses in loads tree.  */
+  unsigned int load_accesses;
   /* global_memory_read is not set for functions calling functions
  with !binds_to_current_def which, after interposition, may read global
  memory but do nothing useful with it (except for crashing if some
diff --git a/gcc/testsuite/g++.dg/tree-ssa/pr103168.C 
b/gcc/testsuite/g++.dg/tree-ssa/pr103168.C
new file mode 100644
index 000..82924a3e3ce
--- /dev/null
+++ b/gcc/testsuite/g++.dg/tree-ssa/pr103168.C
@@ -0,0 +1,24 @@
+// { dg-do compile }
+// { dg-options "-O2 -fdump-tree-fre1-details" }
+
+struct a
+{
+  int a;
+  static __attribute__ ((noinline))
+  int ret (int v) {return v;}
+
+  __attribute__ ((noinline))
+  int inca () {return a++;}
+};
+
+int
+test()
+{
+  struct a av;
+  av.a=1;
+  int val = av.ret (0) + av.inca();
+  av.a=2;
+  return val + av.ret(0) + av.inca();
+}
+
+/* { dg-final { scan-tree-dump-times "Replaced a::ret" 1 "fre1" } } */
diff --git a/gcc/tree-ssa-sccvn.c b/gcc/tree-ssa-sccvn.c
index 149674e6a16..719f5184654 100644
--- a/gcc/tree-ssa-sccvn.c
+++ b/gcc/tree-ssa-sccvn.c
@@ -71,6 +71,8 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-ssa-loop-niter.h"
 #include "builtins.h"
 #include "fold-const-call.h"
+#include "ipa-modref-tree.h"
+#include "ipa-modref.h"
 #include "tree-ssa-sccvn.h"
 
 /* This algorithm is based on the SCC algorithm presented by Keith
@@ -5084,12 +5086,118 @@ visit_reference_op_call (tree lhs, gcall *stmt)
   struct vn_reference_s vr1;
   vn_reference_t vnresult = NULL;
   tree vdef = gimple_vdef (stmt);
+  modref_summary *summary;
 
   /* Non-ssa lhs is handled in copy_reference_ops_from_call.  */
   if (lhs && TREE_CODE (lhs) != SSA_NAME)
 lhs = NULL_TREE;
 
   vn_reference_lookup_call (stmt, , );
+
+  /* If the lookup did not succeed for pure functions try to use
+ modref info to find a candidate to CSE to.  */
+  const int accesses_limit = 8;
+  if (!vnresult
+  && !vdef
+  && lhs
+  && gimple_vuse (stmt)
+  && (((summary = get_modref_function_summary (stmt, NULL))
+  && !summary->global_memory_read
+  && summary->load_accesses < accesses_limit)
+ || gimple_call_flags (stmt) & ECF_CONST))
+{
+  /* First search if we can do someting useful and build a
+vector of all loads we have to check.  */
+  bool unknown_memory_access = false;
+  auto_vec accesses;
+
+  if (summary)
+   {
+ for (auto base_node : summary->loads->bases)
+   if (unknown_memory_access)
+ break;
+   else for (auto ref_node : base_node->refs)
+ if (unknown_memory_access)
+   break;
+ else for (auto access_node : ref_node->accesses)
+   {
+ accesses.quick_grow (accesses.length () + 1);
+ if (!access_node.get_ao_ref (stmt,  ()))
+   {
+ /* We could use get_call_arg (...) and initialize
+a ref based on the argument and unknown offset in
+some cases, but we have to get a ao_ref to
+disambiguate against other stmts.  */
+ unknown_memory_access = true;
+ break;
+   }
+ else
+   {
+ 

Re: [Bug driver/100937] configure: Add --enable-default-semantic-interposition

2021-11-22 Thread Jan Hubicka via Gcc-bugs
> (The -fno-semantic-interposition thing is probably the biggest performance gap
> between gcc -fpic and clang -fpic.)
Yep, it is often confusing to users (who do not understand what ELF
interposition is) that clang and gcc disagree on default flags here.
Recently -Ofast was extended to imply -fno-semantic-interposition that
will hopefully make more people notice this.

While doing that I have added per-symbol flag about interposition to the
symbol table, so we can also support 

__atttribute__ ((semantic_interposition))

and

__attribute__((no_semantic_interpoition))

if that would be useful for something.


Re: [Bug tree-optimization/103300] New: wrong code at -O3 on x86_64-linux-gnu

2021-11-17 Thread Jan Hubicka via Gcc-bugs
Needs -O2  -floop-unroll-and-jam   --param early-inlining-insns=14
to fail, so I guess it may be issue with unrol-and-jam.


Re: [Bug ipa/103267] Wrong code with ipa-sra

2021-11-16 Thread Jan Hubicka via Gcc-bugs
> @@ -1,4 +1,3 @@
> -static int
>  __attribute__ ((noinline,const))
>  infinite (int p)
>  {
Just for a record, it crahes with or without static int here for me :)

I run across it because the code tracking must access in ipa-sra is IMO
conceptually wrong.  I noticed that because ipa-modref solves similar
problem for kills (both need to verify that given access will always
happen).  The post-dominance check is not enough to verify that because
earlier function calls can do things like EH.  I failed to construct an
actual testcase because on interesting stuff like EH we punt for other
reasons (missed fnspec annotations on EH builtins).  I will play with it
more today.


Re: [Bug ipa/103267] Wrong code with ipa-sra

2021-11-16 Thread Jan Hubicka via Gcc-bugs
Aha, but here is better example (reproduces same way).
In the former one I forgot const attribute which makes it invalid.
The testcase tests that ipa-sra is missing ECF_LOOPING_CONST_OR_PURE
check

static int
__attribute__ ((noinline))
infinite (int p)
{
  if (p)
while (1);
  return p;
}
__attribute__ ((noinline))
static void
test(int p, int *a)
{
  int v = infinite (p);
  if (*a && v)
__builtin_abort ();
}
test2(int *a)
{
  test(0,a);
}
main()
{
  test (1,0);
}


Re: [Bug ipa/103267] Wrong code with ipa-sra

2021-11-16 Thread Jan Hubicka via Gcc-bugs
Works for me even with the 3 warnings.

hubicka@lomikamen:/aux/hubicka/trunk/build-lto2/gcc$ cat >tt.c
__attribute__ ((noinline,const))
infinite (int p)
{
  if (p)
while (1);
  return p;
}
__attribute__ ((noinline))
static void
test(int p, int *a)
{
  int v = infinite (p);
  if (*a && v)
__builtin_abort ();
}
test2(int *a)
{
  test(0,a);
}
main()
{
  test (1,0);
}
hubicka@lomikamen:/aux/hubicka/trunk/build-lto2/gcc$ ./xgcc --version
xgcc (GCC) 12.0.0 2024 (experimental)
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

hubicka@lomikamen:/aux/hubicka/trunk/build-lto2/gcc$ ./xgcc -B ./ -O2 tt.c
tt.c:2:1: warning: return type defaults to ‘int’ [-Wimplicit-int]
2 | infinite (int p)
  | ^~~~
tt.c:16:1: warning: return type defaults to ‘int’ [-Wimplicit-int]
   16 | test2(int *a)
  | ^
tt.c:20:1: warning: return type defaults to ‘int’ [-Wimplicit-int]
   20 | main()
  | ^~~~
hubicka@lomikamen:/aux/hubicka/trunk/build-lto2/gcc$ ./a.out
Segmentation fault



Re: [Bug tree-optimization/103231] New: ICE (nondeterministic) on valid code at -O1 on x86_64-linux-gnu: Segmentation fault

2021-11-14 Thread Jan Hubicka via Gcc-bugs
> [659] % 
> [659] % gcctk -O0 -w small.c
> [660] % 
> [660] % gcctk -O1 -w small.c
> [661] % gcctk -O1 -w small.c
> [662] % gcctk -O1 -w small.c
> gcctk: internal compiler error: Segmentation fault signal terminated program
> cc1
> Please submit a full bug report,
> with preprocessed source if appropriate.
> See  for instructions.
Backtrace here would be useful.  It is bit strange that you did not get
it from error message.  One can use -S -wrapper gdb,--args to make the
cc1 executed within gdb.
> [663] %


Re: [Bug ipa/103230] ipa-modref-tree.h:550:33: runtime error: load of value 255, which is not a valid value for type 'bool'

2021-11-14 Thread Jan Hubicka via Gcc-bugs
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103230
> 
> --- Comment #2 from Martin Liška  ---
> > How do you build ubsan compiler?
> 
> F="-O0 -g -fsanitize=undefined" ; make -j16 all-host -k CFLAGS="$F"
> CXXFLAGS="$F"  LDFLAGS="$F"
> 
> is the fastest approach.
Thanks, it is similar to what I tried.  I guess there should be no ";"
but yet it leds to misconfigured libiberty for me on kunlun.  I will
look into that.


Re: [Bug ipa/103230] New: ipa-modref-tree.h:550:33: runtime error: load of value 255, which is not a valid value for type 'bool'

2021-11-14 Thread Jan Hubicka via Gcc-bugs
> Happens with UBSAN compiler for:
> 
> $ gcc gcc/testsuite/gcc.c-torture/execute/pr71494.c -O1  -flto
> ...
> /home/marxin/Programming/gcc/gcc/ipa-modref-tree.h:550:33: runtime error: load
> of value 255, which is not a valid value for type 'bool'
> #0 0x18acc38 in modref_tree::merge(modref_tree*,
> vec*, modref_parm_map*, bool)
> /home/marxin/Programming/gcc/gcc/ipa-modref-tree.h:550
> #1 0x188452c in modref_propagate_in_scc

At 4385 I have:
   changed |= cur_summary_lto->stores->merge
(callee_summary_lto->stores, _map, _map, !first);
 

parm-map is the vector, however there is no read of it.
There is bool which is relevant only when parm_index is not unknown, so
I suspect it may a full copy with uninitialized bool which would be
harmless. We had similar issues with asan before.

How do you build ubsan compiler?
Honza


Re: [Bug ipa/103211] [12 Regression] 416.gamess crashes after r12-5177-g494bdadf28d0fb35

2021-11-12 Thread Jan Hubicka via Gcc-bugs
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103211
> 
> --- Comment #2 from Martin Liška  ---
> Optimized dump differs for couple of functions in the same way:
> 
> diff -u good bad
> --- good2021-11-12 17:42:36.995947103 +0100
> +++ bad 2021-11-12 17:41:56.728194961 +0100
> @@ -38,7 +38,6 @@
> 
>  ;; Function abrt (abrt_, funcdef_no=10, decl_uid=4338, cgraph_uid=11,
> symbol_order=10) (executed once)
> 
> -Removing basic block 5
>  __attribute__((fn spec (". ")))
>  void abrt ()
>  {
> @@ -350,7 +349,6 @@
>  void setfm (integer(kind=4) * ipar)
>  {
> [local count: 1073741824]:
> -  master.0.setfm (0, ipar_2(D)); [tail call]
>return;
> 
>  }
> 
> maybe the fnspec for master.0.setfm is bad?
> 
> __attribute__((fn spec (". R w ")))
> void master.0.setfm (integer(kind=8) __entry, integer(kind=4) * ipar)
> {
It looks more like pure/const discovery. You should be able to use
-fdump-ipa-all -fdump-tree-all and grep "function found to be" 
either pure or const.

What is body of master.0.setfm. Does it look like it does nothing?

"R" in fnspec means that arg 0 is only read directly and not derefernced.
"w" means that it arg 1 is not escaping.

Honza


Re: [Bug tree-optimization/103175] [12 Regression] internal compiler error: in handle_call_arg, at tree-ssa-structalias.c:4139

2021-11-11 Thread Jan Hubicka via Gcc-bugs
The sanity check verifies that functions acessing parameter indirectly
also reads the parameter (otherwise the indirect reference can not
happen).  This patch moves the check earlier and removes some overactive
flag cleaning on function call boundary which introduces the non-sential
situation.  I got bit paranoid here on how return value relates to
escaping solution.

as discussed on ML, matmul failure is simply the fact that the testcase
verified missed optimization is still misssed.

diff --git a/gcc/ipa-modref.c b/gcc/ipa-modref.c
index 72006251f29..a97021c6c60 100644
--- a/gcc/ipa-modref.c
+++ b/gcc/ipa-modref.c
@@ -1681,6 +1681,13 @@ modref_lattice::merge (int f)
 {
   if (f & EAF_UNUSED)
 return false;
+  /* Check that flags seems sane: if function does not read the parameter
+ it can not access it indirectly.  */
+  gcc_checking_assert (!(f & EAF_NO_DIRECT_READ)
+  || ((f & EAF_NO_INDIRECT_READ)
+  && (f & EAF_NO_INDIRECT_CLOBBER)
+  && (f & EAF_NO_INDIRECT_ESCAPE)
+  && (f & EAF_NOT_RETURNED_INDIRECTLY)));
   if ((flags & f) != flags)
 {
   flags &= f;
@@ -1874,27 +1881,13 @@ modref_eaf_analysis::merge_call_lhs_flags (gcall *call, 
int arg,
argument if needed.  */
 
 static int
-callee_to_caller_flags (int call_flags, bool ignore_stores,
-   modref_lattice )
+callee_to_caller_flags (int call_flags, bool ignore_stores)
 {
   /* call_flags is about callee returning a value
  that is not the same as caller returning it.  */
   call_flags |= EAF_NOT_RETURNED_DIRECTLY
| EAF_NOT_RETURNED_INDIRECTLY;
-  /* TODO: We miss return value propagation.
- Be conservative and if value escapes to memory
- also mark it as escaping.  */
-  if (!ignore_stores && !(call_flags & EAF_UNUSED))
-{
-  if (!(call_flags & EAF_NO_DIRECT_ESCAPE))
-   lattice.merge (~(EAF_NOT_RETURNED_DIRECTLY
-| EAF_NOT_RETURNED_INDIRECTLY
-| EAF_UNUSED));
-  if (!(call_flags & EAF_NO_INDIRECT_ESCAPE))
-   lattice.merge (~(EAF_NOT_RETURNED_INDIRECTLY
-| EAF_UNUSED));
-}
-  else
+  if (ignore_stores)
 call_flags |= ignore_stores_eaf_flags;
   return call_flags;
 }
@@ -2033,15 +2026,9 @@ modref_eaf_analysis::analyze_ssa_name (tree name)
  if (!(call_flags & (EAF_NOT_RETURNED_DIRECTLY
  | EAF_UNUSED)))
m_lattice[index].merge (~(EAF_NO_DIRECT_ESCAPE
- | EAF_NO_INDIRECT_ESCAPE
- | EAF_UNUSED));
- if (!(call_flags & (EAF_NOT_RETURNED_INDIRECTLY
- | EAF_UNUSED)))
-   m_lattice[index].merge (~(EAF_NO_INDIRECT_ESCAPE
  | EAF_UNUSED));
  call_flags = callee_to_caller_flags
-  (call_flags, false,
-   m_lattice[index]);
+  (call_flags, false);
}
  m_lattice[index].merge (call_flags);
}
@@ -2057,8 +2044,7 @@ modref_eaf_analysis::analyze_ssa_name (tree name)
  !(call_flags & EAF_NOT_RETURNED_DIRECTLY),
  !(call_flags & EAF_NOT_RETURNED_INDIRECTLY));
  call_flags = callee_to_caller_flags
-  (call_flags, ignore_stores,
-   m_lattice[index]);
+  (call_flags, ignore_stores);
  if (!(ecf_flags & (ECF_CONST | ECF_NOVOPS)))
m_lattice[index].merge (call_flags);
}
@@ -2082,8 +2068,7 @@ modref_eaf_analysis::analyze_ssa_name (tree name)
if (!(ecf_flags & (ECF_CONST | ECF_NOVOPS)))
  {
call_flags = callee_to_caller_flags
-(call_flags, ignore_stores,
- m_lattice[index]);
+(call_flags, ignore_stores);
if (!record_ipa)
  m_lattice[index].merge (call_flags);
else
@@ -2105,8 +2090,7 @@ modref_eaf_analysis::analyze_ssa_name (tree name)
else
  {
call_flags = callee_to_caller_flags
-(call_flags, ignore_stores,
- m_lattice[index]);
+(call_flags, ignore_stores);
if (!record_ipa)
  

Re: [Bug middle-end/102997] [12 Regression] 45% 454.calculix regression with LTO+PGO -march=native -Ofast on Zen since r12-4526-gd8edfadfc7a9795b65177a50ce44fd348858e844

2021-11-08 Thread Jan Hubicka via Gcc-bugs
Note that it still seems to me that the crossed_loop_header handling is
overly conservative.  We have:

@ -2771,6 +2771,7 @@ jt_path_registry::cancel_invalid_paths 
(vec )
   bool seen_latch = false;
   int loops_crossed = 0;
   bool crossed_latch = false;
+  bool crossed_loop_header = false;
   // Use ->dest here instead of ->src to ignore the first block.  The
   // first block is allowed to be in a different loop, since it'll be
   // redirected.  See similar comment in profitable_path_p: "we don't
@@ -2804,6 +2805,14 @@ jt_path_registry::cancel_invalid_paths 
(vec )
  ++loops_crossed;
}
 
+  // ?? Avoid threading through loop headers that remain in the
+  // loop, as such threadings tend to create sub-loops which
+  // _might_ be OK ??.
+  if (e->dest->loop_father->header == e->dest
+ && !flow_loop_nested_p (exit->dest->loop_father,
+ e->dest->loop_father))
+   crossed_loop_header = true;
+
   if (flag_checking && !m_backedge_threads)
gcc_assert ((path[i]->e->flags & EDGE_DFS_BACK) == 0);
 }
@@ -2829,6 +2838,21 @@ jt_path_registry::cancel_invalid_paths 
(vec )
   cancel_thread (, "Path crosses loops");
   return true;
 }
+  // The path should either start and end in the same loop or exit the
+  // loop it starts in but never enter a loop.  This also catches
+  // creating irreducible loops, not only rotation.
+  if (entry->src->loop_father != exit->dest->loop_father
+  && !flow_loop_nested_p (exit->src->loop_father,
+ entry->dest->loop_father))
+{
+  cancel_thread (, "Path rotates loop");
+  return true;
+}
+  if (crossed_loop_header)
+{
+  cancel_thread (, "Path crosses loop header but does not exit it");
+  return true;
+}
   return false;
 }
 
If I read it correctly, for a path that enters the loop and later leaves
it (where threading is desirable since we skip the whole loop) the logic
above will still return true (after finishing the whole walk which seems
like a waste).

This may trigger more often at -Os since we limit loop header copying.

And indeed, fixing profile updating would be nice.  Why the updating
code is not reused across different threaders?  (I wrote several thread
updating functions for varioius threaders introduced & remoed in the
past and I wonder why we need to keep reinventing it)


Re: [Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r

2021-11-07 Thread Jan Hubicka via Gcc-bugs
> 
> This PR is still open, at least for slowdown in the threader with LTO.  The
> issue is ranger wide, so it may also cause slowdowns  on non-LTO builds for
> WRF, though I haven't checked.
I just wanted to record the fact somewhere since I was looking up the
revision range mostly to figure out if there was modref change that may
cause this.

Non-lto builds seems fine.  I suppose LTo is needed ot make bug enough
CFGs.  Thanks for looking into it.

Honza


Re: [Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r

2021-11-04 Thread Jan Hubicka via Gcc-bugs
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943
> 
> Aldy Hernandez  changed:
> 
>What|Removed |Added
> 
>  Depends on||103058
> 
> --- Comment #18 from Aldy Hernandez  ---
> 251.wrf_r is no longer building.  Seems to be the same issue in PR103058.
> 
> during GIMPLE pass: alias
> module_fr_fire_phys.fppized.f90: In function 'init_fuel_cats':
> module_fr_fire_phys.fppized.f90:136:25: internal compiler error: in
> gimple_call_static_chain_flags, at gimple.c:1669
>   136 | subroutine init_fuel_cats
>   | ^
> 0x6957b5 gimple_call_static_chain_flags(gcall const*)
> /home/aldyh/src/clean/gcc/gimple.c:1669

I have commited workaround for this.
However here it looks like a frontend issue - I do not think Fortran
should produce nested functions with external linkage. At least there
seems to be no good reason for doing so since they can not be called
cross-module.

Honza


Re: [Bug d/103040] [12 Regression] gdc.dg/torture/pr101273.d FAILs

2021-11-02 Thread Jan Hubicka via Gcc-bugs
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103040
> 
> --- Comment #15 from Iain Buclaw  ---
> Got it. The difference between D and C++ is a matter of early inlining.
> 
> The C++ example Jakub posted fails in the same way that D does if you compile
> with: -O1 -fno-inline
Great, I will take a look now (I was travelling that is why i did not
started earlier)

Honza


Re: [Bug d/103040] [12 Regression] gdc.dg/torture/pr101273.d FAILs

2021-11-02 Thread Jan Hubicka via Gcc-bugs
> See above comments from Iain, even if that pre-initialization is removed it is
> still miscompiled.  And, the testcase fails not because of the padding bits 
> not
> being zero, but because the address of self stored into one of the fields 
> isn't
> there or modref thinks it can't be changed or set to that.  But for
> corresponding C++ it handles it ok.
Perhaps TREE_ADDRESSABLE on the type which is being used to test whether
return slot pointer may escape.


Re: [Bug middle-end/102997] [12 Regression] 45% 454.calculix regression with LTO+PGO -march=native -Ofast between ce4d1f632ff3f680550d3b186b60176022f41190 and 6fca1761a16c68740f875fc487b98b6bde8e9be7

2021-10-29 Thread Jan Hubicka via Gcc-bugs
> Not seen on Haswell (but w/o PGO).  Is this PGO specific?  There's another
> large jump visible end of 2019.
It is between 2019-11-15 and 18 but the revisions does not exist at git
- perhaps they reffer to the old git mirror. Martin will know better.

In that range there are many of Richard's vectorizer changes and my
patch fixing calculation of ref time in inliner which may be culprints.

Honza


Re: [Bug ipa/102982] [12 Regression] Dead Code Elimination Regression at -O3 (trunk vs 11.2.0)

2021-10-28 Thread Jan Hubicka via Gcc-bugs
> 
> fixup_cfg already removes write-only stores so that seems fit for that
> purpose.
> 
> Btw,
> 
> static int x = 1;
> 
> int main()
> {
>   x = 1;
> }
> 
> should ideally be handled as well as maybe the more common(?)
> 
> static int x[128];
> 
> int main()
> {
>   memset (x, 0, 128*4);
> }
> 
> so we'd like to store a (constant) RHS for the stores in the summaries?
> (I suppose we cannot selectively stream in a single stmt ;))

Yep, what we want is to have way to attach summaries to particular
IPA_REF_load/store/addr just like we annotate call sites...
It would be nice to extend symbol-summary.h for that.  Martin, would you
be interested to look into it?

Honza