[Bug middle-end/106078] Invalid loop invariant motion with non-call-exceptions

2022-06-25 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106078

--- Comment #5 from hubicka at kam dot mff.cuni.cz ---
> > For this one it's PRE hoisting *b across the endless loop (PRE handles
> > calls as possibly not returning but not loops as possibly not 
> > terminating...)
> > So it's a different bug.
> 
> Btw, C++ requiring forward progress makes the testcase undefined.
In my understanding access to volatile variable is a forward progres:
In a valid C++ program, every thread eventually does one of the
following:

   -terminate
   -makes a call to an I/O library function
   -performs an access through a volatile glvalue
   -performs an atomic operation or a synchronization operation 

I think one can also replace volatile access by atomics: we only need to
know the side effects of that operation.
Honza

[Bug tree-optimization/105739] [10 Regression] Miscompilation of Linux kernel update.c

2022-06-20 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105739

--- Comment #17 from hubicka at kam dot mff.cuni.cz ---
> Fixed (Honza, hope you don't mind the backports I've done, did that so that it
> is on time for 10.4).
Thanks. I don't mind: I was planning to do them this week anyway and
also extra VOLATILE check shoud not be very risky.

[Bug tree-optimization/105973] Wrong branch prediction for if (COND) { if(x) noreturn1(); else noreturn2(); }

2022-06-17 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105973

--- Comment #7 from hubicka at kam dot mff.cuni.cz ---
> I can try implementing that.
That would be nice.  I think if path predictor logic hits the same
predictor both ways, it can simply predict that basic block with the
same predictor (i.e. recurse on that block).   It will need to keep
track what basic blocks were already predicted to handle cycles.

Honza

[Bug tree-optimization/105973] Wrong branch prediction for if (COND) { if(x) noreturn1(); else noreturn2(); }

2022-06-15 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105973

--- Comment #4 from hubicka at kam dot mff.cuni.cz ---
> possibly the noreturn predictor doesn't trigger here
The problem here is that we handle throw1 and throw2 as two independent
reason for code path to be unlikely.  This makes us to look for
conditional controling the code path and we predict the inner if both
ways as unlikely so they cancel out.

I suppose we could be smarter here and teach path predictor to check for
this situation and if both directions of a branch are considered bad for
same reason continue propagating up.  This shoudl not be hard to
implement.

[Bug c/105977] [12/13 Regression] ICE in gimple_call_static_chain_flags, at gimple.cc:1636

2022-06-15 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105977

--- Comment #2 from hubicka at kam dot mff.cuni.cz ---
Inline transform function does fixup_cfg and other mandatory things so
disabling it will likely cause other kind of ICEs as well.  I wonder if
there is more robust way to do so (one can make these part of
all_ipa_tranforms and not part of a pass but it looks bit ugly).

[Bug lto/105877] GNU strip breaks -flto -g .o files

2022-06-09 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105877

--- Comment #2 from hubicka at kam dot mff.cuni.cz ---
> "strip -g" removed .gnu.debuglto_.debug_info sections. Should LTO remove the
> references of stripped debug info?  Or should "strip -g" keep LTO debug info?

I suppose we should try to immitate what happens without -flto, so
probably check for the presence of debuglto sections and avoid producing
debug info when they have been stripped?
I am not sure how hard would be to implement this especially in
situations where part of object files were stripped and others not?

[Bug tree-optimization/105739] [10 Regression] Miscompilation of Linux kernel update.c

2022-06-04 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105739

--- Comment #7 from hubicka at kam dot mff.cuni.cz ---
> Honza, any estimate how long this could take?  I'd prefer to wait with 10.4 
> for
> it  if it isn't going to take too long.
I am at a conference this week with talk at Wednesday.  I will try to
debug this during the event.

Honza

[Bug lto/105727] __builtin_constant_p expansion in LTO

2022-05-26 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105727

--- Comment #15 from hubicka at kam dot mff.cuni.cz ---
> No, see c#10.
I know it will work if BUILD_BUG call is removed. However the only
reason I can see why original author put it there is that he/she wanted
to write special case checkers for constants that appears in practice
and wanted unhandled constants to turn to compiler errors rather than
quietly going the slow path...

[Bug lto/105727] __builtin_constant_p expansion in LTO

2022-05-25 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105727

--- Comment #12 from hubicka at kam dot mff.cuni.cz ---
> > My guess is that the
> > BUILD_BUG();
> > line is the sole thing that is wrong, it should be just break;
> > as the memory_is_poisoned_n(addr, size); will handle all the sizes,
> > regardless if they are constants or not.
> 
> Sure, I'm going to suggest such a change.
To me it looked like a protection that size is not going to be large
(or perhaps author wants to add extra special cases as they are needed)

Honza

[Bug c/105728] dead store to static var not optimized out

2022-05-25 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105728

--- Comment #1 from hubicka at kam dot mff.cuni.cz ---
> To me, all of these do the same thing and should generate the same code.
> As nobody else can see removeme, and we aren't leaking its address, shouldn't
> the compiler be able to deduce that all accesses to removeme are
> inconsequential and can be removed?
> 
> My gcc 11.3 generates a condidion and a store and a return 0 for dummy1, the
> same thing for dummy2, but for dummy3 it understands that it only needs to 
> emit
> a return 0.

GCC detects "write olny" variables and that is what matches for dummy3.
I am not 100% sure it is valid to do the optimization in other two cases
since when multiple threads are considered. In any case we lack tracking
of constants stored to global variables which is something ipa-cp can be
extended to.

Honza

[Bug fortran/103662] [12 Regression] TBAA problem in Fortran FE triggering in gfortran.dg/unlimited_polymorphic_3.f03

2022-04-26 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103662

--- Comment #24 from hubicka at kam dot mff.cuni.cz ---
Thanks a lot!
Honza

[Bug fortran/103662] [12 Regression] TBAA problem in Fortran FE triggering in gfortran.dg/unlimited_polymorphic_3.f03

2022-04-20 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103662

--- Comment #19 from hubicka at kam dot mff.cuni.cz ---
> I agree with Jakub that this is good progression so we should probably get 
> this
> to a shape that is committable and commit it.
> 
> Btw - thanks for working on the Fortran frontend issues!
Indeed! This bug was haunting me for a while :)

[Bug ipa/103818] [12 Regression] ICE: in insert, at ipa-modref-tree.c:591 since r12-3202-gf5ff3a8ed4ca9173

2022-04-12 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103818

--- Comment #6 from hubicka at kam dot mff.cuni.cz ---
> For 128-bit math, you can e.g. use poly_offset_int, which generally looks like
> a better type for these offsets and sizes (especially if they are counted in
> bits not bytes).
> Just to perform the <= comparison on the wider poly_offset_int, perhaps:
> --- gcc/ipa-modref-tree.cc.jj   2022-04-06 16:44:44.0 +0200
> +++ gcc/ipa-modref-tree.cc  2022-04-11 17:50:02.937784764 +0200
> @@ -380,9 +380,13 @@ modref_access_node::update2 (poly_int64
>  new_max_size = max_size2;
>else
>  {
> -  new_max_size = max_size2 + offset2 - offset1;
> -  if (known_le (new_max_size, max_size1))
> +  poly_offset_int n = max_size2;
> +  n += offset2;
> +  n -= offset1;
> +  if (known_le (n, max_size1))
> new_max_size = max_size1;
> +  else
> +   new_max_size = max_size2 + offset2 - offset1;
>  }
> 
>update (parm_offset1, offset1,
> (though, not sure how can you narrow that back to poly_int64).
> Though, the big question is what should one do with these overflows or
> underflows that aren't representable, update2 can't fail right now and the
> above still ICEs.
> Is there some way how to indicate it is an access to an unknown offset?
I sent patch for this (and plan to commit it today).  One can set new_max_size
to -1 which means unknown/unlimited rnag on overflow.  Underflow should
be impossible, since we always keep offsets/sizes nonnegative.

Honza

[Bug ipa/104303] [12 regression] gnatmake is miscompiled by IPA/modref

2022-04-07 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104303

--- Comment #10 from hubicka at kam dot mff.cuni.cz ---
Hi,
> 
> I'm testing this, adding escaped_local_p parameters throughout the API and
> adjusting consumers, throwing away ref_may_access_global_memory_p.
> 
> Following passes adjusted from false to true:
> 
> - DCE in mark_stmt_if_obviously_necessary but that's just compile-time
>   optimization I think
> - trans-mem.cc I think got it wrong, so now fixed (but trans-mem is dead)
> - the modref query in modref_may_conflict
> 
> all others keep passing false.

Thanks!  I was adding the function when introducing global memory logic
to modref. This was targeted to handle fatigue regression at -O2 but it
also helps in quite few extra cases.

Honza

[Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r

2022-03-17 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

--- Comment #50 from hubicka at kam dot mff.cuni.cz ---
> It helps quite a bit, the worst case is now
> 
>  tree VRP   :   5.14 (  7%)   0.02 (  3%)   5.15 (  
> 7%)
>2
> 9M (  3%)
>  backwards jump threading   :   4.05 (  6%)   0.00 (  0%)   4.06 (  
> 6%)
>  222
> 0k (  0%)
> 
> overall the patch reduces compile time from 766s to 749 (parallel compile,
> serial LTO, release checking).  So IMHO definitely worth it if you are happy
> with it.
This looks really promising.  Does it also solve the situation with
--param param_inline_functions_called_once_insns=100

I will benchmark how copmile time now behaves with respect increasing
this bound.

Honza

[Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22

2022-01-27 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #21 from hubicka at kam dot mff.cuni.cz ---
> I would say so.  It saves code size and also uop space unless the two
> can magically fuse to a immediate to %xmm move (I doubt that).
I made simple benchmark

double a=10;
int
main()
{
long int i;
double sum,val1,val2,val3,val4;
 for (i=0;i<10;i++)
 {
#if 1
#if 1
asm __volatile__("movabsq $0x3ff03db8fde2ef4e, %%r8;vmovq  
%%r8, %0": "=x"(val1): :"r8","xmm11");
asm __volatile__("movabsq $0x3ff03db8fde2ef4e, %%r8;vmovq  
%%r8, %0": "=x"(val2): :"r8","xmm11");
asm __volatile__("movabsq $0x3ff03db8fde2ef4e, %%r8;vmovq  
%%r8, %0": "=x"(val3): :"r8","xmm11");
asm __volatile__("movabsq $0x3ff03db8fde2ef4e, %%r8;vmovq  
%%r8, %0": "=x"(val4): :"r8","xmm11");
#else
asm __volatile__("movq %1, %%r8;vmovq   %%r8, %0":
"=x"(val1):"m"(a) :"r8","xmm11");
asm __volatile__("movq %1, %%r8;vmovq   %%r8, %0":
"=x"(val2):"m"(a) :"r8","xmm11");
asm __volatile__("movq %1, %%r8;vmovq   %%r8, %0":
"=x"(val3):"m"(a) :"r8","xmm11");
asm __volatile__("movq %1, %%r8;vmovq   %%r8, %0":
"=x"(val4):"m"(a) :"r8","xmm11");
#endif
#else
asm __volatile__("vmovq   %1, %0": "=x"(val1):"m"(a)
:"r8","xmm11");
asm __volatile__("vmovq   %1, %0": "=x"(val2):"m"(a)
:"r8","xmm11");
asm __volatile__("vmovq   %1, %0": "=x"(val3):"m"(a)
:"r8","xmm11");
asm __volatile__("vmovq   %1, %0": "=x"(val4):"m"(a)
:"r8","xmm11");
#endif
sum+=val1+val2+val3+val4;
 }
 return sum;

and indeed the third variant runs 1.2s while the first two takes equal
time 2.4s on my zen2 laptop.

[Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22

2022-01-27 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #16 from hubicka at kam dot mff.cuni.cz ---
> 
> Yep, we also have code like
> 
> -   movabsq $0x3ff03db8fde2ef4e, %r8
> ...
> -   vmovq   %r8, %xmm11

It is loading random constant to xmm11.  Since reg<->xmm moves are
relatively cheap it looks OK to me that we generate this.  Is it faster
to load constant from the memory?
> movq.LC11(%rip), %rax
> vmovq   %rax, %xmm14
This is odd indeed and even more odd that we both movabs and memory load... 
i386 FE plays some games with allowing some constants in SSE
instructions (to allow simplification and combining) and split them out
to memory later.  It may be consequence of this.

[Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22

2022-01-27 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #13 from hubicka at kam dot mff.cuni.cz ---
> > According to znver2_cost
> > 
> > Cost of sse_to_integer is a little bit less than fp_store, maybe increase
> > sse_to_integer cost(more than fp_store) can helps RA to choose memory
> > instead of GPR.
> 
> That sounds reasonable - GPR<->xmm is cheaper than GPR -> stack -> xmm
> but GPR<->xmm should be more expensive than GPR/xmm<->stack.  As said above
> Zen2 can do reg -> mem, mem -> reg via renaming if 'mem' is somewhat special,
> but modeling that doesn't seem to be necessary.
> 
> We seem to have store costs of 8 and load costs of 6, I'll try bumping the
> gpr<->xmm move cost to 8.

I was simply following latencies here, so indeed reg<->mem bypass is not
really modelled.  I recall doing few experiments which was kind of
inconclusive.

[Bug tree-optimization/104203] [12 Regressions] huge compile-time regression since r12-6606-g9d6a0f388eb048f8

2022-01-24 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104203

--- Comment #8 from hubicka at kam dot mff.cuni.cz ---
> > bool
> Since the pass issues a bunch other warnings (e.g., -Wstringop-overflow,
> -Wuse-after-free, etc.) the gate doesn't seem right.  But since #pragma GCC
> diagnostic can re-enable warnings disabled by -w (or turn them into errors) 
> any
> gate that considers the global option setting will also interfere with that.

What the gate is executed the flags are set according to cfun, so you
can just combine all warning options for warnings issued by the pass
into the gate.

[Bug ipa/104203] [12 Regressions] huge IPA compile-time regression since r12-6606-g9d6a0f388eb048f8

2022-01-24 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104203

--- Comment #4 from hubicka at kam dot mff.cuni.cz ---
So I assume that this is due to new pass_waccess which was added into
early optimizations.  I think this is not really ipa component but
tree-optimize.

[Bug ipa/104187] Call site specific attribute to control inliner

2022-01-24 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104187

--- Comment #3 from hubicka at kam dot mff.cuni.cz ---
> > Confirmed.  For C++ an attribute on the call stmt might work even better.
> 
> I think even C, it might be better too. [[]] syntax is there for GNU C now 
> too.
> But those only apply to the full statement and not just a subset of one ...
I agree it is quite desirable thing to have

Implementaiton on the inliner side is quite easy: all we need to do is
to detect the attribute and put it into the inline summary and then
handle it same way as always_inline.

However what is the way to add the attribute to front-ends and get it
attached to the gimple call statement?

[Bug tree-optimization/103195] [12 Regression] tfft2 text grows by 70% with -Ofast since r12-5113-gd70ef65692fced7a

2022-01-18 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103195

--- Comment #6 from hubicka at kam dot mff.cuni.cz ---
> So nothing to see?  I guess our unit growth limit doesn't trigger because it's
> a small (benchmark) unit?
Yep, unit growths do not apply for very small units.  ipa-cp heuristics
still IMO needs work and be based on relative speedups rather then
absolute for the cutoffs.

[Bug tree-optimization/103423] [12 Regression] 19% cpu2006 wrf compile time regression with -flto since r12-2353-g8da8ed435e9f01b3

2022-01-18 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103423

--- Comment #6 from hubicka at kam dot mff.cuni.cz ---
> Fixed, the links now show better than ever numbers.
It is only fixed by not inlining enough (since I added 
--param max-inline-functions-called-once).  Without LTO we still have
quite important slowdown.
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=332.270.8=304.270.8=340.270.8=272.270.8=392.270.8=411.270.8=291.270.8;

Do we want to open separate PR for that?

[Bug ipa/95558] [9/10/11/12 Regression] Invalid IPA optimizations based on weak definition

2022-01-17 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95558

--- Comment #13 from hubicka at kam dot mff.cuni.cz ---
> Result pure looping 0
> Function found to be pure: foo/4
This is good - we are supposed to find it to be pure and walk all
aliases and update noninterposable ones
> Declaration updated to be pure: foo/4
I have to take look why this happens though.

[Bug fortran/103662] [12 Regression] TBAA problem in Fortran FE triggering in gfortran.dg/unlimited_polymorphic_3.f03

2022-01-17 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103662

--- Comment #9 from hubicka at kam dot mff.cuni.cz ---
> I'm inclined to make this P1 even though it is gfortran only.  As a last 
> resort
> it should work to make the receiver side a ref-all pointer.
Yes, I also think this is important bug (like all TBAA related wrong
codes).  Getting alias set 0 to the receiver is probably doable even
with my Fortran frontend knowledge, but I would like to understand why
the types are not matching which I don't.

[Bug ipa/95558] [9/10/11/12 Regression] Invalid IPA optimizations based on weak definition

2022-01-17 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95558

--- Comment #8 from hubicka at kam dot mff.cuni.cz ---
> > Do weak aliases fall under some implicit ODR here?
> 
> The whole definition of "weak" is that it entitles you to make a definition
> that will be exempt from ODR, where a non-weak definition, if any, replaces 
> it.

I fixed recently bug in pure-const discovery which made us to add
attributes to weaks while it should not, so perhaps this bug is gone.
For non-inline functions we should handle those as interposable.
Concerning warning, I we make difference between may and must warnings.
I think this is still useful do be diagnosed for user to consider
(perhaps there is a reason why all overwritters has to be weak), so I
think we should porduce may form of warning.

I will take a look.

[Bug tree-optimization/103989] [12 regression] std::optional and bogus -Wmaybe-unitialized at -Og since r12-1992-g6feb628a706e86eb

2022-01-13 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103989

--- Comment #14 from hubicka at kam dot mff.cuni.cz ---
> 
> Sure - I just remember (falsely?) that we finally decided to do it :)

I do not recall this, but I may have forgotten :))

> If we don't run IPA inline we don't figure we failed to inline the
> always_inline either ;)  And IPA inline can expose more indirect
> alywas-inlines we only discover after even more optimization so the
> issue is really moot unless we sorry () (or link-fail).

Problem with kernel was that it relied on quite complicated indirect
inliing of always inlined and did not work without it.  At beggining I
think we should have introduced two attributes - always_inline and
disregard_inline_limits just like we have internally. Always_inline
should have never allowed public linkage or taking its address, but
it is probbly late to fix that :(

Honza

[Bug tree-optimization/103989] [12 regression] std::optional and bogus -Wmaybe-unitialized at -Og since r12-1992-g6feb628a706e86eb

2022-01-13 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103989

--- Comment #12 from hubicka at kam dot mff.cuni.cz ---
> Yeah, and since we inline all always inline and also flatten during
> early inline the IPA inliner should really do nothing.

OK, can_inline_edge_p will do that but we will still walk the calls
which is bit of wasted effort.  Will look into that incrementally.
> 
> > It may be nice to also avoid re-analyzing functions completely to save
> > some compile time, but that may be bit tricky if we decide to do things
> > like cross-module always_inline.  I will look into that too, but perhaps
> > that can wait for next stage1.
> 
> I think we decided to have all always inline early and drop bodies now,
> didn't you patch it that way this stage1?
I think that gets into trouble i.e. with kernel calling always_inlines
indirectly. It is a mess...
> 
> IIRC the CCP was necessary for some odd reason I don't remember
> right now ;)

I would bet it was builtin_constat_p and inlining, so perhaps if we
completely ban late inlining ccp can go.
> 
> > Looking into what passes are in the pipeline I also noticed that
> > we could also probably skip late modref from -Og optimization pipeline.
> 
> Yes, I noticed it was there just now ...

I will make patch to drop it for trunk.  If we disable all optimization
the repeated pure-const seems pointless as well?

[Bug tree-optimization/103989] [12 regression] std::optional and bogus -Wmaybe-unitialized at -Og since r12-1992-g6feb628a706e86eb

2022-01-13 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103989

--- Comment #10 from hubicka at kam dot mff.cuni.cz ---
> And I'm intentionally not doing this because -Og should still remove
> abstraction during early inlining (for functions marked 'inline'), we
> just don't want to spend the extra compile time doing IPA inlining
> or cleaning up after IPA inlining.

Indeed it seemed bit too extreme to disable inlining completely at -Og :)
So you want early inliner to behave normally according to flags
while IPA inliner to skip all calls where either caller or
callee is -Og and callee is not always_inline?
This can be done in can_inline_edge_p. I will make patch for that.

It may be nice to also avoid re-analyzing functions completely to save
some compile time, but that may be bit tricky if we decide to do things
like cross-module always_inline.  I will look into that too, but perhaps
that can wait for next stage1.
> 
> but of course IPA inline size estimates are a bit off since we are not
> going to do any optimization on the inlined body.

We still do late ccp and other things, but indeed inline estimates
anticipate FRE to happen which it doesn't.

Looking into what passes are in the pipeline I also noticed that
we could also probably skip late modref from -Og optimization pipeline.
(I think David added it htere originally since we do pure-const).
I am thinking about retiring pure-const from pure-const discovery next
stage1 since modref should be monotonosly stronger doing that (and
if we add a stripped down mode of modref it should not be more expensive
than pure-const)

[Bug tree-optimization/103989] [12 regression] std::optional and bogus -Wmaybe-unitialized at -Og since r12-1992-g6feb628a706e86eb

2022-01-13 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103989

--- Comment #8 from hubicka at kam dot mff.cuni.cz ---
> You can not disable an IPA pass becasuse then we will mishandle
> optimize attributes.  I think you simply want to set
> 
> flag_inline_small_functions = 0
> flag_inline_functions_called_once = 0 

Actually I forgot, we have flag_no_inline which makes
tree_inlinable_function_p to return false for everything except for
ALWAYS_INLINE and so we only want to set this one for Og.

[Bug tree-optimization/103989] [12 regression] std::optional and bogus -Wmaybe-unitialized at -Og since r12-1992-g6feb628a706e86eb

2022-01-13 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103989

--- Comment #7 from hubicka at kam dot mff.cuni.cz ---
> --- Comment #6 from Richard Biener  ---
> Honza, -Og was supposed to not do so much work, I intended to disable IPA
> inlining but there's no knob for that.  I wonder where to best put such
> guard?  I set flag_inline_small_functions to zero for -Og but we still
> run inline_small_functions ().  Basically -Og was supposed to only do
> early opts and then what is necessary for correct RTL expansion.  Doing
> IPA inlining defeats this :/
> 
> Can you help?  Is it safe to simply gate the inline_small_functions ()
> call?  Do we want an extra -f[no-]ipa-inline like we have -fearly-inlining?
> 
> Using -fdisable-ipa-inline gets rid of the diagnostic

You can not disable an IPA pass becasuse then we will mishandle
optimize attributes.  I think you simply want to set

flag_inline_small_functions = 0
flag_inline_functions_called_once = 0 

and we should only inline always_inlines. inline_small_functions will
still loop and check inlinability of functions but if everything is
compiled with -Og it will not find anything inlinable and exit.

Perhaps we may also extend initialize_inline_failed to add
CIF_DEBUG_OPTIMIZE so -Winline does say something more useufl then
"function not considered"

Honza

[Bug rtl-optimization/98782] [11 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies

2022-01-11 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #42 from hubicka at kam dot mff.cuni.cz ---
on zen2 and 3 with -flto the speedup seems to be cca 12% for both -O2
and -Ofast -march=native which is both very nice!
Zen1 for some reason sees less improvement, about 6%.
With PGO it is 3.8%

Overall it seems a win, but there are few noteworthy issues.

I also see a 6.69% regression on x64 with -Ofast -march=native -flto
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=475.377.0
and perhaps 3-5% on sphinx
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=476.280.0
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=227.280.0

For non-spec benchmarks spec there is a regression on nbench
https://lnt.opensuse.org/db_default/v4/CPP/graph?plot.0=26.645.1
There are also large changes in tsvc
https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report
it may be noise since kernels are tiny, but for example x293 reproduces
both on kabylake and zen by about 80-90% regression that may be easy to
track (the kernel is included in the testsuite). Same regression is not
seen on zen3, so may be an ISA specific or so.

FInally there seems relatively large code size savings on polyhedron
benchmarks today (8% on capacita, 

Thanks a lot!

[Bug ipa/103830] [12 Regression] null pointer access optimized away by removing function call at -Og

2022-01-04 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103830

--- Comment #5 from hubicka at kam dot mff.cuni.cz ---
> I think the recent modref change made the function const.
> 
> And no, we shouldn't DSE any volatile store and generally we don't.  It's
> probably some side-effect of modref that we do.  Using -fno-ipa-pure-const
> "fixes" this bug with -Og:
> 
>  local analysis of void MyClass::call()/1
>NULL memory access; terminating BB
> Function is locally const.
> callgraph:
> 
> so it's caused by the recent change to mitigate path-isolation damage to
> modref.

The change indeed assumes that with -fdelete-null-pointer-checks the
access to NULL is invalid no matter if it is volatile or normal.  I
would expect code having exception handlers at address 0 to be always
built with -fno-delete-null-pointer-checks.

If we want to preserve user defined volaitle NULL, is there way to stick
another flag on the memory accesses synthetised by isolate-paths to mark
them as OK to be optimized this way?

[Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r

2022-01-03 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

--- Comment #33 from hubicka at kam dot mff.cuni.cz ---
With the inliner tweaks (which I hope to get bit more aggressive this
week) we "solved" the wrf compile time with LTO by simply not building
the gigantic functions.  However we still have significant regressions
without LTO. 

Accroding to
https://lnt.opensuse.org/db_default/v4/SPEC/spec_report/branch
With spec2006 on kabylake with _O2 we have:
Test Name   gcc-6   gcc-7   gcc-8   gcc-9   gcc-10  gcc-11
gcc-trunk
SPECFP  256.464 3.59%   15.50%  19.29%  29.53%  33.44%  43.50%
SPECint 119.368 3.17%   13.60%  17.23%  14.17%  26.58%  33.58%
and spec2007
SPECFP  638.337 5.39%   21.20%  25.40%  50.18% 45.00%   58.72%
SPECint 217.977 4.03%   11.47%  16.17% 13.29%   22.52%  27.28%

Growing SPECFP -O2 build time by over 10% in one release is quite a lot
though it happened previously with gcc7->gcc8 and gcc9->gcc10 (if I remember
correctly gcc10 was caused by Fortran revisiting libgfortran API and
increasing binaries significantly).

The bigger increase in SPECfp compared to SPECint seems to be attributed
to wrf which regresses 122% compared to gcc6. The build time graph is
here
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=116.548.8=161.548.8=68.548.8=206.548.8=373.548.8=430.548.8=30.548.8;

So I think this is still about the most important compile time issue we
have.  I plan to look at the profile updating issues in threader and
hopefully get bit more faimilar with the code to possibly help here.

[Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not

2021-12-23 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797

--- Comment #16 from hubicka at kam dot mff.cuni.cz ---
> > 
> > It could be done, but I was under impression that the sequence to load 1.0f
> > into topmost elements nullifies the benefit of operation to divide two
> 
> Sure, so perhaps we should somewhat increase the vectorization cost of 
> V2SFmode
> division so that we would use it only if it is part of longer sequences?

I wonder how the hardware implements it.  If divps is of similar latency
as divss then I guess it is essentially always win to load 1.0 to the
upper part, since it is slow operation.  On the other hand if divps is
about 4 times divss, then this may be harmful.

Agner Fog seems to be listing divss and divps with same latencies.
For zen it is 10 cycles which should be enough to do the setup.

[Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not

2021-12-22 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797

--- Comment #9 from hubicka at kam dot mff.cuni.cz ---
> recip pass happens after vectorization 
> I don't know/understand why though.
Yep, I suppose we want to either special case this in vectorizer or make
it earlier...  I also wonder why the code is vectorized for pairs of
values and third one is computed separately and why we don't use vectors
of length 4...

[Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not

2021-12-22 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797

--- Comment #5 from hubicka at kam dot mff.cuni.cz ---
Created attachment 52042
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52042=edit
b.slp1

[Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not

2021-12-22 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797

--- Comment #4 from hubicka at kam dot mff.cuni.cz ---
> -E and remove not needed code.
> 
> > The
> > declaratoins are quite convoluted, but the function is well isolated and
> > easy to inspect from full one...
> 
> Do we speak about:
> https://github.com/mozilla/gecko-dev/blob/bd25b1ca76dd5d323ffc69557f6cf759ba76ba23/gfx/2d/FilterNodeSoftware.cpp#L3670-L3691
> ?
Yes.
> 
> It should be possible creating a synthetical test that does the same (and 
> lives
> in a loop, right?).

Well, I tried that for a while and got bit lost (either code got
vectorized by both gcc and clang or by neither).  There are more issues
where we have over 50% regression wrt clang build at gfx code, so I
think I will first try to reproduce those locally and perf them to see
if there is more pattern here.

The releavant code is:

uint32_t mozilla::gfx::{anonymous}::SpecularLightingSoftware::LightPixel
(struct SpecularLightingSoftware * const this, const struct Point3D & aNormal,
const struct Point3D & aVectorToLight, uint32_t aColor)
{

   [local count: 118111600]:
  _48 = MEM[(const struct BasePoint3D
*)aVectorToLight_25(D)].D.75826.D.75829.z;
  _49 = _48 + 1.0e+0;
  _50 = MEM[(const struct BasePoint3D
*)aVectorToLight_25(D)].D.75826.D.75829.y;
  _51 = _50 + 0.0;
  _52 = MEM[(const struct BasePoint3D
*)aVectorToLight_25(D)].D.75826.D.75829.x;
  _53 = _52 + 0.0;
  _80 = _53 * _53;
  _82 = _51 * _51;
  _83 = _80 + _82;
  _85 = _49 * _49;
  _86 = _83 + _85;
  if (_86 u>= 0.0)
goto ; [99.95%]
  else
goto ; [0.05%]

   [local count: 118052545]:
  _87 = .SQRT (_86);
  goto ; [100.00%]

   [local count: 59055]:
  _29 = __builtin_sqrtf (_86);

   [local count: 118111600]:
  # _30 = PHI <_29(4), _87(3)>
  _88 = _53 / _30;
  _89 = _51 / _30;
  _90 = _49 / _30;
  _41 = MEM[(const struct BasePoint3D *)aNormal_26(D)].D.75826.D.75829.x;
  _39 = _41 * _88;
  _37 = MEM[(const struct BasePoint3D *)aNormal_26(D)].D.75826.D.75829.y;
  _33 = _37 * _89;
  _27 = _33 + _39;
  _45 = MEM[(const struct BasePoint3D *)aNormal_26(D)].D.75826.D.75829.z;
  _46 = _45 * _90;
  _47 = _27 + _46;
  if (_47 >= 0.0)
goto ; [59.00%]
  else
goto ; [41.00%]


With -Ofast it gets bit more streamlined:


   [local count: 118111600]:
  _48 = MEM[(const struct BasePoint3D
*)aVectorToLight_25(D)].D.75826.D.75829.z;
  _49 = _48 + 1.0e+0;
  _50 = MEM[(const struct BasePoint3D
*)aVectorToLight_25(D)].D.75826.D.75829.y;
  _51 = MEM[(const struct BasePoint3D
*)aVectorToLight_25(D)].D.75826.D.75829.x;
  powmult_78 = _51 * _51;
  powmult_80 = _50 * _50;
  _81 = powmult_78 + powmult_80;
  powmult_83 = _49 * _49;
  _84 = _81 + powmult_83;
  _85 = __builtin_sqrtf (_84);
  _86 = _51 / _85;
  _87 = _50 / _85;
  _88 = _49 / _85;
  _41 = MEM[(const struct BasePoint3D *)aNormal_26(D)].D.75826.D.75829.x;
  _39 = _41 * _86;
  _37 = MEM[(const struct BasePoint3D *)aNormal_26(D)].D.75826.D.75829.y;
  _33 = _37 * _87;
  _27 = _33 + _39;
  _45 = MEM[(const struct BasePoint3D *)aNormal_26(D)].D.75826.D.75829.z;
  _46 = _45 * _88;
  _47 = _27 + _46;
  if (_47 >= 0.0)
goto ; [59.00%]
  else
goto ; [41.00%]

But I do not quite see in the slp dump why this is not considered for
vectorization.

I attach the dump.
Honza

[Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not

2021-12-22 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797

--- Comment #2 from hubicka at kam dot mff.cuni.cz ---
> Can you please attach a reduced test-case?
Do you know how to produce one with a reasonable effort?  The
declaratoins are quite convoluted, but the function is well isolated and
easy to inspect from full one...

[Bug ipa/103766] [12 Regression] Initialization of variable passed via static chain is lost.

2021-12-19 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103766

--- Comment #12 from hubicka at kam dot mff.cuni.cz ---
> Even trying to find a Fortran testsuite friendly version is hard because the
> issue can only happen with print, I tried even doing internal write to a 
> string
> it is passing without any changes.
Yep, any reads that happen in the code visible to compiler will prevent
the bug from happening.   It is a hard case - but as I said I think
there is quite low chance of it going back.

[Bug ipa/103766] [12 Regression] Initialization of variable passed via static chain is lost.

2021-12-19 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103766

--- Comment #8 from hubicka at kam dot mff.cuni.cz ---
> > I would welcome a testuite friendly version of the fortran testcase
> 
> Both Andrew and I failed to make a C reproducer - what about just taking the
> -fdump-tree-gimple, as input would that work ?
Me too.  The testcase depends on fnspec annotations produced by Fortran
so we get both noescape and noread to trigger the wrong code.  I guess
we may try to feed fnspec to gimple frontend but I am not even sure it
is possible, since fnspec is internal attribute

I suppose we can also live without a testcase - I hope I will not do
again such a stupid oversight.

[Bug ipa/103734] IPA-CP opportunity for imagick in SPECCPU 2017

2021-12-15 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103734

--- Comment #1 from hubicka at kam dot mff.cuni.cz ---
I think ipa-cp heuristics still needs some work.  It is nice that we got
it to do something, but I just checked and with LTO+PGO build of clang
it produces cca 30 clones that are not "for all known contexts", so it
seems that it is still quite strict on what it considers to clone at
least with FDO.

On the other hand we have PR103195 where tfft2 grows by 70% becuase
function is cloned with no great benefits.

I think one problem is that it is based on absolute time improvements.
This has problems because bigger functions run longer and more likely
to see bigger absolute improvement, but we are more interested in
relative improvement (i.e. duplicating very large fucntion to get 0.001%
speedup is less useful than duplicating small function that get 10%
speedup even if absolute number is same).

With profile feedback absolute numbers are OK since they should
translate to actual speedups of the whole program.  But then longer
trained programs will see bigger numbers than shorter trained numbers so
perhaps this should be relative to overall running time of the program?

[Bug fortran/103662] TBAA problem in Fortran FE triggering in gfortran.dg/unlimited_polymorphic_3.f03

2021-12-14 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103662

--- Comment #4 from hubicka at kam dot mff.cuni.cz ---
> Can you explain in simple words why adding
> 
>   if (ptr1%k .ne. 42) print *
> 
> before the line
> 
>   if (ptr1%k .ne. 42) STOP 2
> 
> makes the test succeed, but adding it after that one the test fails?
> Or adding
> 
>   print *, ptr1%k
> 
> near the end of subroutine foo?
I am not sure why it depends whether it is before or after the stop.
Perhaps we thread the conditionals and optimize out print in some cases?
However the reason why print matters is because print call confuses the
simplistic escape analysis done in modref.  It now understand the fact
that calling an unknown function where all arguments are constant is
safe and can not make function local parameters escaped.  However it
punt on calling anything with parameter.  Before the patch the same
testcase fails if you replace STOP 2 by anything that terminate program
w/o calling function. For example by division by zero.

I have patch to use PTA escape analysis in modref but it hits other
problem with fortran FE and openMP - GOMP_parallel makes parameters
escape even if Fortran FE produces fnspec delcaring them nonescape...

[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies

2021-12-14 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #26 from hubicka at kam dot mff.cuni.cz ---
> It's with LTO, I'll see if non-LTO has the same benefit.  In terms of 
> code-size
> it looks like it accounts for a 20% increase for binary size, but the hot
> function shrinks approx 6x. But of course this increase covers the entire
> binaries and there are other hot functions like GetVirtualPixelsFromNexus that
> also shrink significantly when specialized.
Thanks.  I will check this.  We mey need tweaking cost model to first
specialize the functions that reduce more (by doing multiple passes with
decreasing threshold)

It is really nice to see some real improvmeents from ipa-cp. For years
the pass did almost nothing on the benchmarks I tested (spec, Firefox,
clang) but it has changed.  I also see measurable speedups on clang
binary with ipa-cp on.

[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies

2021-12-14 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #24 from hubicka at kam dot mff.cuni.cz ---
> Awesome! thanks!
> 
> I wonder if we can get rid of the final magic parameter too, we run with
> --param ipa-cp-unit-growth=80 too which seems to have no more effect on
> exchange, though still a slight effect on leela but a 12% gain in imagick.

Interesting - this is Martin Jambor's area but I do not think we was
aware of this.  I wonder how bug growth is really needed - 80% is quite
a lot especially if imagemagick is already big binary.  Is this about
LTO or non-LTO build?

[Bug gcov-profile/103652] Producing profile with -O2 -flto and trying to consume it with -O3 -flto leads to ICEs on indirect call profiling

2021-12-13 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103652

--- Comment #4 from hubicka at kam dot mff.cuni.cz ---
> 
> Well, I'm specifically speaking about:
> error: the control flow of function ‘BZ2_compressBlock’ does not match its
> profile data (counter ‘arcs’) 
> 
> this type of errors should not happen even in a multi-threaded programs.

There are some cases where I see even those on clang build - I am not
sure how that happens (if it is configury difference or generated code
or gcc bug) It is on my TODO to analyse...

In any case we should never ICE on malformed gcda files. Especially not
by buffer overflow :)
> 
> > I think you can produce testcase easily by making a function with one
> > indirect call for train run and many indirect calls in profile-use run.
> > 
> > I have patch to avoid the buffer overflow - can send it after getting to
> > office.
> 
> Sure, please send it.
Attached.

Honza

[Bug gcov-profile/103652] Producing profile with -O2 -flto and trying to consume it with -O3 -flto leads to ICEs on indirect call profiling

2021-12-13 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103652

--- Comment #2 from hubicka at kam dot mff.cuni.cz ---
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103652
> 
> --- Comment #1 from Martin Liška  ---
> (In reply to Jan Hubicka from comment #0)
> > Building clang in the funny way (training with -O2 -flto -fprofile-generate)
> > and use with -O3 -flto -fprofile-generate I get ICE here:
> 
> Do you mean, -O3 -flto -fprofile-use, right?
> 
> I would have expected some -Werror=coverage-mismatch errors. Can you please
> create a smaller reproducer where you see the error?
You need to disable those to build multithreaded programs like clang.  I
think you can produce testcase easily by making a function with one
indirect call for train run and many indirect calls in profile-use run.

I have patch to avoid the buffer overflow - can send it after getting to
office.

[Bug ipa/103601] [12 Regression] ICE in insert_kill, at ipa-modref-tree.c:84 since r12-5244-g64f3e71c302b4a13

2021-12-10 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103601

--- Comment #5 from hubicka at kam dot mff.cuni.cz ---
Thanks Roger and Andrew!  It was on my TODO for weekend and I am very
happy you beat me :)

[Bug ipa/103636] [12 Regression] Clang build fails with -flto -fno-strict-aliaisng -flifetime-dse=1 -fprofile-generate

2021-12-10 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103636

--- Comment #7 from hubicka at kam dot mff.cuni.cz ---
I use
cmake -G "Unix Makefiles" /home/jh/llvm-project/llvm
-DCLANG_TABLEGEN=/home/jh/llvm-project/llvm/out/stage1/bin/clang-tblgen
-DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_COMPILER=/home/jh/trunk-install/bin/g++
-DCMAKE_C_COMPILER=/home/jh/trunk-install/bin/gcc -DLL
VM_BINUTILS_INCDIR=/home/jh/binutils-install/include/ -DLLVM_BUILD_RUNTIME=No
-DLLVM_ENABLE_PROJECTS="clang;compiler-rt;lld" -DLL
VM_TABLEGEN=/home/jh/llvm-project/llvm/out/stage1/bin/llvm-tblgen
-DLLVM_TARGETS_TO_BUILD=X86 -DCMAKE_C_FLAGS_RELEASE="-O2 -fno-s trict-aliasing 
-flifetime-dse=1 -flto=auto -fno-semantic-interposition -fprofile-generate
-DNDEBUG" -DCMAKE_CXX_FLAGS_RELEASE="- O2  -flto=auto  -fno-strict-aliasing
-flifetime-dse=1 -fno-semantic-interposition -fprofile-generate -DNDEBUG"

I already checked that with that hunk in can_inline_edge_p commented out
the clang build and training run works.
Sadly clang segfaults with either strict aliasing or lifetime-dse
enabled.

[Bug tree-optimization/103592] fatigue2 benchmarks on zen runs 43% faster with -fno-tree-vectorize -fno-tree-slp-vectorize

2021-12-06 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103592

--- Comment #1 from hubicka at kam dot mff.cuni.cz ---
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
> [Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)
note that fatigue2 is polyhedron, not spec...

[Bug tree-optimization/103409] [12 Regression] 18% SPEC2017 WRF compile-time regression with -O2 -flto since r12-5228-gb7a23949b0dcc4205fcc2be6b84b91441faa384d

2021-12-01 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103409

--- Comment #13 from hubicka at kam dot mff.cuni.cz ---
> I've fixed the threading slowdown.  Can someone verify and close this PR if 
> all
> the slowdown has been accounted for?  If not, then someone needs to explore 
> any
> slowdown unrelated to the threader.
The plots linked from the PR are live, so they should come back to
original speed (so far they did not).

https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=226.548.8
and
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=287.548.8

[Bug ipa/103441] [12 Regression] ICE in cgraph_node::verify_node() building libgo on powerpc64le-linux-gnu (--with-cpu=power9)

2021-11-26 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103441

--- Comment #3 from hubicka at kam dot mff.cuni.cz ---
> #0  gimple_set_bb (stmt=0x3fffb01a2be0, bb=0x0) at ../../gcc/gimple.c:1772
> #1  0x107209b0 in gsi_remove (i=0x3fffd7c8,
> remove_permanently=) at ../../gcc/gimple-iterator.c:569
> #2  0x10bddd3c in remove_bb (bb=0x3fffb0de2698) at
> ../../gcc/tree-cfg.c:2338
> #3  0x104aa8a0 in delete_basic_block (bb=) at
> ../../gcc/cfghooks.c:616
> #4  0x10c087f4 in cleanup_control_flow_pre () at
> ../../gcc/tree-cfgcleanup.c:979
> #5  0x10c0b090 in cleanup_tree_cfg_noloop (ssa_update_flags=0) at
> ../../gcc/tree-cfgcleanup.c:1073
> #6  cleanup_tree_cfg (ssa_update_flags=) at
> ../../gcc/tree-cfgcleanup.c:1183

It is not safe to do cleanup_tree_cfg in IPA transform pass since it
does not maintain cgraph edges (which are needed during inliing).
There is delete_unreachable_blocks_update_callgraph.

[Bug ipa/103432] [12 regression] libjxl-0.5 is miscompiled, works fine with -fno-ipa-modref

2021-11-26 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103432

--- Comment #3 from hubicka at kam dot mff.cuni.cz ---
Caused by stupid thinko (also present in gcc11). I compute right
min_flags but then use wrong value (without dereference applied).
I am testing the following.

diff --git a/gcc/ipa-modref.c b/gcc/ipa-modref.c
index c2edc0d28a6..9e537b04196 100644
--- a/gcc/ipa-modref.c
+++ b/gcc/ipa-modref.c
@@ -4201,7 +4201,7 @@ update_escape_summary_1 (cgraph_edge *e,
  if (ee->direct && !em->direct)
min_flags = deref_flags (min_flags, ignore_stores);
  struct escape_entry entry = {em->parm_index, ee->arg,
-  ee->min_flags,
+  min_flags,
   ee->direct & em->direct};
  sum->esc.safe_push (entry);
}

[Bug tree-optimization/103423] [12 Regression] 19% cpu2006 wrf compile time regression with -flto since r12-2353-g8da8ed435e9f01b3

2021-11-26 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103423

--- Comment #3 from hubicka at kam dot mff.cuni.cz ---
> Oh, you are right, then it started with r12-2353-g8da8ed435e9f01b3.
OK so mine, (as I sort of suspected :)
If it is easy for you to get -ftime-report of before and after
build, it would be great to attach it.  I will try to reproduce this
with -fno-ipa-modref on current tree.

[Bug tree-optimization/103409] [12 Regression] 18% SPEC2017 WRF compile-time regression with -O2 -flto since r12-3903-g0288527f47cec669

2021-11-25 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103409

--- Comment #6 from hubicka at kam dot mff.cuni.cz ---
> Started with r12-3903-g0288527f47cec669.
This is September change (for which we have PR102943) however the
regression range was g:1ae8edf5f73ca5c3 (or g:264f061997c0a534 on second
plot) and g:3e09331f6aeaf595 which is the latest regression visible on
the graphs appearing betwen Nov 12 and Nov 15.

The September regression is there too, but it is tracket as PR102943

[Bug tree-optimization/103423] [12 Regression] 19% cpu2006 wrf compile time regression with -flto since r12-3903-g0288527f47cec669

2021-11-25 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103423

--- Comment #1 from hubicka at kam dot mff.cuni.cz ---
Martin,
My original report here was on regression at July 17 2021 (range
g:0b7a11874d4eb428 and g:704e8a825c78b9a8)
which seems unrelated to g:r12-3903-g0288527f47cec669 
which is in Sep 21 2021

I think we are mixing up the cpu2006 and cpu2017 wrf's that seems to
regress on different times.
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103423
> 
> Martin Liška  changed:
> 
>What|Removed |Added
> 
>See Also||https://gcc.gnu.org/bugzill
>||a/show_bug.cgi?id=103409
> 
> -- 
> You are receiving this mail because:
> You reported the bug.

[Bug tree-optimization/103409] [12 Regression] 18% WRF compile-time regression with -O2 -flto between g:264f061997c0a534 and g:3e09331f6aeaf595

2021-11-25 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103409

--- Comment #2 from hubicka at kam dot mff.cuni.cz ---
> The two main changes during that time period was jump threading and modref.
> modref seems might be more likely with wrf being fortran code and even using
> nested functions and such.

Yep, I think both are possible.  There was also change enabling ipa-sra
on fortran.

There was yet another regression in wrf earlier that I think was related
to imroving flags propagation.  I think those are not by modref itself,
but triggers some other pass.  I wil try to look up the regression range
(it was before ranger got in).

[Bug tree-optimization/103168] Value numbering for PRE of pure functions can be improved

2021-11-22 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103168

--- Comment #15 from hubicka at kam dot mff.cuni.cz ---
The patch passed testing on x86_64-linux.

[Bug tree-optimization/103168] Value numbering for PRE of pure functions can be improved

2021-11-22 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103168

--- Comment #14 from hubicka at kam dot mff.cuni.cz ---
This is bit modified patch I am testing.  I added pre-computation of the
number of accesses, enabled the path for const functions (in case they
have memory operand), initialized alias sets and clarified the logic
around every_* and global_memory_accesses

PR tree-optimization/103168
(modref_summary::finalize): Initialize load_accesses.
* ipa-modref.h (struct modref_summary): Add load_accesses.
* tree-ssa-sccvn.c (visit_reference_op_call): Use modref
info to walk the virtual use->def chain to CSE pure
function calls.

* g++.dg/tree-ssa/pr103168.C: New testcase.

diff --git a/gcc/ipa-modref.c b/gcc/ipa-modref.c
index 4f9323165ea..595eb6e0d8f 100644
--- a/gcc/ipa-modref.c
+++ b/gcc/ipa-modref.c
@@ -725,6 +727,23 @@ modref_summary::finalize (tree fun)
break;
}
 }
+  if (loads->every_base)
+load_accesses = 1;
+  else
+{
+  load_accesses = 0;
+  for (auto base_node : loads->bases)
+   {
+ if (base_node->every_ref)
+   load_accesses++;
+ else
+   for (auto ref_node : base_node->refs)
+ if (ref_node->every_access)
+   load_accesses++;
+ else
+   load_accesses += ref_node->accesses->length ();
+   }
+}
 }

 /* Get function summary for FUNC if it exists, return NULL otherwise.  */
diff --git a/gcc/ipa-modref.h b/gcc/ipa-modref.h
index f868eb6de07..a7937d74945 100644
--- a/gcc/ipa-modref.h
+++ b/gcc/ipa-modref.h
@@ -53,6 +53,8 @@ struct GTY(()) modref_summary

   /* Flags coputed by finalize method.  */

+  /* Total number of accesses in loads tree.  */
+  unsigned int load_accesses;
   /* global_memory_read is not set for functions calling functions
  with !binds_to_current_def which, after interposition, may read global
  memory but do nothing useful with it (except for crashing if some
diff --git a/gcc/testsuite/g++.dg/tree-ssa/pr103168.C
b/gcc/testsuite/g++.dg/tree-ssa/pr103168.C
new file mode 100644
index 000..82924a3e3ce
--- /dev/null
+++ b/gcc/testsuite/g++.dg/tree-ssa/pr103168.C
@@ -0,0 +1,24 @@
+// { dg-do compile }
+// { dg-options "-O2 -fdump-tree-fre1-details" }
+
+struct a
+{
+  int a;
+  static __attribute__ ((noinline))
+  int ret (int v) {return v;}
+
+  __attribute__ ((noinline))
+  int inca () {return a++;}
+};
+
+int
+test()
+{
+  struct a av;
+  av.a=1;
+  int val = av.ret (0) + av.inca();
+  av.a=2;
+  return val + av.ret(0) + av.inca();
+}
+
+/* { dg-final { scan-tree-dump-times "Replaced a::ret" 1 "fre1" } } */
diff --git a/gcc/tree-ssa-sccvn.c b/gcc/tree-ssa-sccvn.c
index 149674e6a16..719f5184654 100644
--- a/gcc/tree-ssa-sccvn.c
+++ b/gcc/tree-ssa-sccvn.c
@@ -71,6 +71,8 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-ssa-loop-niter.h"
 #include "builtins.h"
 #include "fold-const-call.h"
+#include "ipa-modref-tree.h"
+#include "ipa-modref.h"
 #include "tree-ssa-sccvn.h"

 /* This algorithm is based on the SCC algorithm presented by Keith
@@ -5084,12 +5086,118 @@ visit_reference_op_call (tree lhs, gcall *stmt)
   struct vn_reference_s vr1;
   vn_reference_t vnresult = NULL;
   tree vdef = gimple_vdef (stmt);
+  modref_summary *summary;

   /* Non-ssa lhs is handled in copy_reference_ops_from_call.  */
   if (lhs && TREE_CODE (lhs) != SSA_NAME)
 lhs = NULL_TREE;

   vn_reference_lookup_call (stmt, , );
+
+  /* If the lookup did not succeed for pure functions try to use
+ modref info to find a candidate to CSE to.  */
+  const int accesses_limit = 8;
+  if (!vnresult
+  && !vdef
+  && lhs
+  && gimple_vuse (stmt)
+  && (((summary = get_modref_function_summary (stmt, NULL))
+  && !summary->global_memory_read
+  && summary->load_accesses < accesses_limit)
+ || gimple_call_flags (stmt) & ECF_CONST))
+{
+  /* First search if we can do someting useful and build a
+vector of all loads we have to check.  */
+  bool unknown_memory_access = false;
+  auto_vec accesses;
+
+  if (summary)
+   {
+ for (auto base_node : summary->loads->bases)
+   if (unknown_memory_access)
+ break;
+   else for (auto ref_node : base_node->refs)
+ if (unknown_memory_access)
+   break;
+ else for (auto access_node : ref_node->accesses)
+   {
+ accesses.quick_grow (accesses.length () + 1);
+ if (!access_node.get_ao_ref (stmt,  ()))
+   {
+ /* We could use get_call_arg (...) and initialize
+a ref based on the argument and unknown offset in
+some cases, but we have to get a ao_ref to
+disambiguate against other stmts.  */
+ unknown_memory_access = true;
+ 

[Bug driver/100937] configure: Add --enable-default-semantic-interposition

2021-11-22 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100937

--- Comment #12 from hubicka at kam dot mff.cuni.cz ---
> (The -fno-semantic-interposition thing is probably the biggest performance gap
> between gcc -fpic and clang -fpic.)
Yep, it is often confusing to users (who do not understand what ELF
interposition is) that clang and gcc disagree on default flags here.
Recently -Ofast was extended to imply -fno-semantic-interposition that
will hopefully make more people notice this.

While doing that I have added per-symbol flag about interposition to the
symbol table, so we can also support 

__atttribute__ ((semantic_interposition))

and

__attribute__((no_semantic_interpoition))

if that would be useful for something.

[Bug tree-optimization/103168] Value numbering for PRE of pure functions can be improved

2021-11-22 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103168

--- Comment #12 from hubicka at kam dot mff.cuni.cz ---
> unsigned p;
> unsigned __attribute__((noinline)) test (void)
> {
>   return p;
> }
> 
> modref analyzing 'test' (ipa=0) (pure)
>  - Analyzing load: p
>- Recording base_set=0 ref_set=0
>  - modref done with result: tracked.

Modref does not record accesses to global vars as known baes and just as
accesses relative to MODREF_UNKONWN_PARM (since ipa-reference does this)
So only useful info it collect from loads/stores is the alias set and
here it is 0 (why?). 
Then it drops the summary as useless since function is detected as pure
earlier and that holds the same info.

[Bug tree-optimization/103168] Value numbering for PRE of pure functions can be improved

2021-11-22 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103168

--- Comment #9 from hubicka at kam dot mff.cuni.cz ---
> so indeed that's an issue.  So it's a bug fixed, not an optimization
> regression.

I know, but the bug was fixed in unnecesarily generous way preventing a
lot of valid tranforms (esnetially disabling CSE just because DSE is not
safe)

[Bug tree-optimization/103223] [12 regression] Access attribute dropped when ipa-sra is applied

2021-11-21 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103223

--- Comment #11 from hubicka at kam dot mff.cuni.cz ---
> Xeon(R) Platinum 8358 (IceLake) (64C 128T 512G):
> BenchMarks  Copies  RunTime1RunTime2Rate1   Rate2   
> Compare
> 548.exchange2_r 128 479 913 700 367 -47.57%
> 
> Xeon(R) Gold 6252 (CascadeLake) (48C 96T 192G)
> BenchMarks  Copies  RunTime1RunTime2Rate1   Rate2   
> Compare
> 548.exchange2_r 96  643 1240391 203 -48.08%

I filled in PR103227 to track this problem.  There seems to be two
issues visible on exchange2.  First is that ipa-sra changes order of
functions which in which inliner visits them and this makes difference
in inlining decisions. Second is that ipa-sra makes some constant
propagation info to be lost.  With Martin we look into this.

[Bug ipa/103227] [12 Regression] 58% exchange2 regression with -Ofast -march=native on zen3 since r12-5223-gecdf414bd89e6ba251f6b3f494407139b4dbae0e

2021-11-20 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103227

--- Comment #9 from hubicka at kam dot mff.cuni.cz ---
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103227
> ... fixing this problem properly.
> I just loked into thi again and we already have code that preserves
> propagates bits on pointer parmeters (since these do not have value
> ranges). Same way we need to preserve the known partial aggregates and
> hook it up into sccvn's vn_reference_lookup_2 and _3.

Also note that with this implemented I think we should be able to remove
the ipa-cp transformation code :)

[Bug ipa/103227] [12 Regression] 58% exchange2 regression with -Ofast -march=native on zen3 since r12-5223-gecdf414bd89e6ba251f6b3f494407139b4dbae0e

2021-11-20 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103227

--- Comment #8 from hubicka at kam dot mff.cuni.cz ---
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103227
> 
> --- Comment #7 from Martin Jambor  ---
> (In reply to hubicka from comment #5)
> > > I like the idea of transformation phases better than putting
> > > everything into tree-inline (and by extension ipa-param-manipulation)
> > > but perhaps we have to do aggregate constant replacements there too?
> > 
> > So the situation is that we inline call A->B (where both A and B are
> > clones of the main function) and while we place uses of the constant
> > parmater in A we miss replacement in B because transform is not run on
> > it.
> 
> No, we miss it everywhere, even in A (see that the code above is from
> BB 2) or probably also without any cloning whatsoever. This happens
> when IPA-SRA does its thing on the same parameter on which IPA-CP
> decided to propagate aggregate constants.
> 
> In the IPA analysis stage (which creates the virtual clones), IPA-CP
> runs before IPA-SRA.  But in the transformation phase, it is
> apparently the other way round - well, not exactly, IPA-SRA does not
> formally have a transformation phase, it happens as part of
> tree_function_versioning, but the effect is the same.

OK, so the problem is that we don't do ipa-sra changes "in place" as a
well behaved transform pass but it i merged into versioning code while
ipa-cp is the other way around.  So one fix would be alo to make ipa-cp
understand that changes to signature happened and update its summary
just like we do for modref?

We will need it also for ...
> 
> > 
> > I think proper solution here (discussed also few years ago) is to keep
> > the optimization summaries and teach value numbering to look up the
> > constant from the summary.
> > 
> 
> Yes, but this is another (but different) problem that we probably also
> should try to solve now.

... fixing this problem properly.
I just loked into thi again and we already have code that preserves
propagates bits on pointer parmeters (since these do not have value
ranges). Same way we need to preserve the known partial aggregates and
hook it up into sccvn's vn_reference_lookup_2 and _3.

Honza

[Bug ipa/103227] [12 Regression] 58% exchange2 regression with -Ofast -march=native on zen3 since r12-5223-gecdf414bd89e6ba251f6b3f494407139b4dbae0e

2021-11-19 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103227

--- Comment #5 from hubicka at kam dot mff.cuni.cz ---
> I like the idea of transformation phases better than putting
> everything into tree-inline (and by extension ipa-param-manipulation)
> but perhaps we have to do aggregate constant replacements there too?

So the situation is that we inline call A->B (where both A and B are
clones of the main function) and while we place uses of the constant
parmater in A we miss replacement in B because transform is not run on
it.

I think proper solution here (discussed also few years ago) is to keep
the optimization summaries and teach value numbering to look up the
constant from the summary.

We also have other situations where the existing transform pass fails to
pattern match and this lets us to feed other info, like value ranges to
the optimizer.

We have open PR somwhere for this problem, right?

[Bug tree-optimization/103168] Value numbering for PRE of pure functions can be improved

2021-11-18 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103168

--- Comment #6 from hubicka at kam dot mff.cuni.cz ---
>   bool unknown_memory_access = false;
>   if (summary = get_modref_function_summary (stmt, NULL))
> {
>   /* First search if we can do someting useful.
>  Like for dse it is easy to precompute in the summary now
>  and will be happy to implement that.  */
>   for (auto base_node : summary->loads->bases)
> if (base_node->all_bases || unknown_memory_access)
>   {
> unknown_memory_access = true;
> break;
>   }
> else
>   for (auto ref_node : base_node->refs)
> if (ref_node->all_refs)
>   {
> unknown_memory_access = true;
> break;
>   }
> 
> /* Do the walking.  */
> if (!unknown_memory_access)
>   for (auto base_node : summary->loads->bases)
> for (auto ref_node : base_node->refs)
>   if (ref_node->all_refs)
> unknown_memory_access = true;
>   else
> for (auto access_node : ref_node->accesses)
>   if (access_node.get_ao_ref (stmt, )
> {
>   ref.base_alias_set = base_node->base;
>   ref.ref_alias_set = ref_node->base;
  ^^^ ref

[Bug tree-optimization/103168] Value numbering for PRE of pure functions can be improved

2021-11-18 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103168

--- Comment #5 from hubicka at kam dot mff.cuni.cz ---
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103168
> 
> --- Comment #4 from Richard Biener  ---
> (In reply to Jan Hubicka from comment #3)
> > This is simple (and fairly common) case we don't optimize
> > struct a {
> > int a;
> > static __attribute__ ((noinline))
> > int ret (int v) {return v;}
> > 
> > __attribute__ ((noinline))
> > int inca () {return a++;}
> > };
> > int
> > test()
> > {
> > struct a av;
> > av.a=1;
> > int val = av.ret (0) + av.inca();
> > av.a=2;
> > return val + av.ret(0) + av.inca();
> > }
> > 
> > what ret is const however becaue it is in COMDAT group we only detect it as
> > pure which makes GVN to give up on proving that its value did not change
> > over av.a=2 store.  We could easily work this out from modref summary (which
> > says that function makes no memory load)
> 
> This case is a bit different since it just exposes we do not perform any
> sort of alias walking for calls in VN.  In fact even with modref we'd need
> to perform multiple walks with all stored "pieces" ao_refs.  At least that
> should be doable though.

Yep, it was my original intention to point out this :)
> 
> If you can provide a cut place to walk & create those ao_refs I could
> see to cook up the relevant VN bits.  But for next stage1 of course.
The following should work (pretty much same loop is in dse_optimize_call
but for stores instead of loads.)
  bool unknown_memory_access = false;
  if (summary = get_modref_function_summary (stmt, NULL))
{
  /* First search if we can do someting useful.
 Like for dse it is easy to precompute in the summary now
 and will be happy to implement that.  */
  for (auto base_node : summary->loads->bases)
if (base_node->all_bases || unknown_memory_access)
  {
unknown_memory_access = true;
break;
  }
else
  for (auto ref_node : base_node->refs)
if (ref_node->all_refs)
  {
unknown_memory_access = true;
break;
  }

/* Do the walking.  */
if (!unknown_memory_access)
  for (auto base_node : summary->loads->bases)
for (auto ref_node : base_node->refs)
  if (ref_node->all_refs)
unknown_memory_access = true;
  else
for (auto access_node : ref_node->accesses)
  if (access_node.get_ao_ref (stmt, )
{
  ref.base_alias_set = base_node->base;
  ref.ref_alias_set = ref_node->base;
  ... do refwalking
}
  else if (get_call_arg (stmt))
... we do not know offset but can still
walk using ptr_deref_may_alias_ref_p_1
  else
unknown_memory_access = true;
... unlikely to happen (for example when number
of args in call stmt mismatches the actual function body)
 }
  if (unknown_memory_access)
... even here walking makes sense to skip stores to
that passes !ref_maybe_used_by_call_p
  ... do walk for function arguments passed by value
> 
> -- 
> You are receiving this mail because:
> You reported the bug.

[Bug ipa/103277] [12 Regression] ICE in branch_prob with -O1 -fbranch-probabilities -fno-ipa-pure-const since r12-5236-g5aa91072e24c1e16

2021-11-18 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103277

--- Comment #4 from hubicka at kam dot mff.cuni.cz ---
> Btw. started with r12-5236-g5aa91072e24c1e16.
Yep, I know - it is modref based DSE that lets us to enable that call as
dead.  So the bug is technically mine if Richi decides to pass it to me
:)

We need to keep track if cleanup_cfg is needed by testing whether
removed call had multpile exit edges, right?

[Bug tree-optimization/103300] [12 Regression] wrong code at -O3 on x86_64-linux-gnu

2021-11-17 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103300

--- Comment #2 from hubicka at kam dot mff.cuni.cz ---
Needs -O2  -floop-unroll-and-jam   --param early-inlining-insns=14
to fail, so I guess it may be issue with unrol-and-jam.

[Bug ipa/103246] [12 Regression] 416.gamess miscompare with -O2 -g -flto=auto since r12-5223-gecdf414bd89e6ba251f6b3f494407139b4dbae0e

2021-11-17 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103246

--- Comment #14 from hubicka at kam dot mff.cuni.cz ---
> Thanks! Great you found it so quickly.
It is bit stupid code since everything is duplicated twice (for LTO and
non-LTO). I have to refactor it: we could have common base of the two
summaries and handle most of things in unified way. The problem is that
gengtype does not like it, so it will probably need manual GGC
annotations.

It stil leaves us with a wrong code, right?

[Bug ipa/101941] [12 Regression] Linux kernel build failure due to retaining fnsplit fragment with __attribute__((__error__))

2021-11-16 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101941

--- Comment #19 from hubicka at kam dot mff.cuni.cz ---
> > * special case function splitting such that a BB that contains a function
> > call which has either warning or error attribute on it; not to split out to
> > a different function.
> 
> Something like this attachment.

I think we should simply drop those flags after splitting since that
preserves the intended semantics?

Honza

[Bug tree-optimization/103266] [12 regression] llvm-13 miscompilation: __builtin_assume_aligned causes over-aggressive dce

2021-11-16 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103266

--- Comment #5 from hubicka at kam dot mff.cuni.cz ---
> I think 'X' means simply not dereferenced or escaping since this was all
> PTA based.  'S' would still eventually allow escaping.  But yes, PTA
> simply takes '1' literally.  So the patch below is IMHO too pessimizing.
> Can you please fixup modref instead?

If X is not meant to be "completely unused" (that is bit useless for
anotating bulitins I think since most of them shoul dbe sane) but all
the other flags together, we should update docs and eaf_flags production
which would fix the issue too.

[Bug ipa/103267] Wrong code with ipa-sra

2021-11-16 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103267

--- Comment #9 from hubicka at kam dot mff.cuni.cz ---
> @@ -1,4 +1,3 @@
> -static int
>  __attribute__ ((noinline,const))
>  infinite (int p)
>  {
Just for a record, it crahes with or without static int here for me :)

I run across it because the code tracking must access in ipa-sra is IMO
conceptually wrong.  I noticed that because ipa-modref solves similar
problem for kills (both need to verify that given access will always
happen).  The post-dominance check is not enough to verify that because
earlier function calls can do things like EH.  I failed to construct an
actual testcase because on interesting stuff like EH we punt for other
reasons (missed fnspec annotations on EH builtins).  I will play with it
more today.

[Bug ipa/103267] Wrong code with ipa-sra

2021-11-16 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103267

--- Comment #6 from hubicka at kam dot mff.cuni.cz ---
Aha, but here is better example (reproduces same way).
In the former one I forgot const attribute which makes it invalid.
The testcase tests that ipa-sra is missing ECF_LOOPING_CONST_OR_PURE
check

static int
__attribute__ ((noinline))
infinite (int p)
{
  if (p)
while (1);
  return p;
}
__attribute__ ((noinline))
static void
test(int p, int *a)
{
  int v = infinite (p);
  if (*a && v)
__builtin_abort ();
}
test2(int *a)
{
  test(0,a);
}
main()
{
  test (1,0);
}

[Bug ipa/103267] Wrong code with ipa-sra

2021-11-16 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103267

--- Comment #5 from hubicka at kam dot mff.cuni.cz ---
Works for me even with the 3 warnings.

hubicka@lomikamen:/aux/hubicka/trunk/build-lto2/gcc$ cat >tt.c
__attribute__ ((noinline,const))
infinite (int p)
{
  if (p)
while (1);
  return p;
}
__attribute__ ((noinline))
static void
test(int p, int *a)
{
  int v = infinite (p);
  if (*a && v)
__builtin_abort ();
}
test2(int *a)
{
  test(0,a);
}
main()
{
  test (1,0);
}
hubicka@lomikamen:/aux/hubicka/trunk/build-lto2/gcc$ ./xgcc --version
xgcc (GCC) 12.0.0 2024 (experimental)
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

hubicka@lomikamen:/aux/hubicka/trunk/build-lto2/gcc$ ./xgcc -B ./ -O2 tt.c
tt.c:2:1: warning: return type defaults to ‘int’ [-Wimplicit-int]
2 | infinite (int p)
  | ^~~~
tt.c:16:1: warning: return type defaults to ‘int’ [-Wimplicit-int]
   16 | test2(int *a)
  | ^
tt.c:20:1: warning: return type defaults to ‘int’ [-Wimplicit-int]
   20 | main()
  | ^~~~
hubicka@lomikamen:/aux/hubicka/trunk/build-lto2/gcc$ ./a.out
Segmentation fault

[Bug testsuite/103264] [12 regression] gcc.dg/tree-prof/merge_block.c fails after r12-5236

2021-11-15 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103264

--- Comment #1 from hubicka at kam dot mff.cuni.cz ---
What breaks in the testcase is updating profile after complete loop
unroling.  I suspect the unrolling is enabled by the extra DSE.

[Bug tree-optimization/103223] [12 regression] Access attribute dropped when ipa-sra is applied

2021-11-15 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103223

--- Comment #8 from hubicka at kam dot mff.cuni.cz ---
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103223
> 
> --- Comment #5 from Martin Sebor  ---
> (In reply to Martin Jambor from comment #4)
> > (In reply to Jan Hubicka from comment #0)
> ...
> > > Martin, I wonder if if you would be OK with simply dropping the access 
> > > when
> > > function signature changes (which I can prepare patch for) or do you want 
> > > to
> > > dive into updating it?
> > 
> > I would be OK with it but I don't think people who invested the energy
> > into these new security warnings would.
> 
> I replied earlier on gcc-patches: I've always intended the access attribute to
> eventually benefit optimization so please feel free (and encouraged :) to use
> it
> for that purpose.  The idea behind it was not just to catch bugs but also to
> enable optimizations based on the expectation that those bugs will have been
> fixed.  (This has to be done carefully since the attribute is also implicitly
> added in contexts where relying on it wouldn't correct for optimization; the
> attirbute API makes it possible to distinguish these cases.)

Is there some summary of what information the access attribute holds
in addition to what we have in fnspec (which is used for the
optimization)?
> 
> By dropping the attribute in IPA passes we would not only give up on detecting
> the bugs the IPA transformations expose but also on the optimization
> opportunities they might open up.

Yep, I hope it is not too hard to write code updating the attribute
after transformation.  We drop the attribute only when we actually do
some change on function parameters.

ipa-modref solves similar problem of updating the modref summaries after
changes.  What I do is to simply compute map translating old parameter
indexes to new via:

  clone_info *info = clone_info::get (node);

  FOR_EACH_VEC_SAFE_ELT (info->param_adjustments->m_adj_params, i, p)
{
  int idx = info->param_adjustments->get_original_index (i);
  if (idx > (int)max)
max = idx;
}

  auto_vec  map;

  map.reserve (max + 1);
  for (i = 0; i <= max; i++)
map.quick_push (-1);
  FOR_EACH_VEC_SAFE_ELT (info->param_adjustments->m_adj_params, i, p)
{
  int idx = info->param_adjustments->get_original_index (i);
  if (idx >= 0)
map[idx] = i;
}

and then I simply rewrite old parameter indexes to new and drop parts of
info that uses parameters that was either removed or replaced by
scalars/components.

Similarly we can do for access attribute.

[Bug tree-optimization/103231] ICE (nondeterministic) on valid code at -O1 on x86_64-linux-gnu: Segmentation fault

2021-11-14 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103231

--- Comment #1 from hubicka at kam dot mff.cuni.cz ---
> [659] % 
> [659] % gcctk -O0 -w small.c
> [660] % 
> [660] % gcctk -O1 -w small.c
> [661] % gcctk -O1 -w small.c
> [662] % gcctk -O1 -w small.c
> gcctk: internal compiler error: Segmentation fault signal terminated program
> cc1
> Please submit a full bug report,
> with preprocessed source if appropriate.
> See  for instructions.
Backtrace here would be useful.  It is bit strange that you did not get
it from error message.  One can use -S -wrapper gdb,--args to make the
cc1 executed within gdb.
> [663] %

[Bug ipa/103230] ipa-modref-tree.h:550:33: runtime error: load of value 255, which is not a valid value for type 'bool'

2021-11-14 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103230

--- Comment #3 from hubicka at kam dot mff.cuni.cz ---
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103230
> 
> --- Comment #2 from Martin Liška  ---
> > How do you build ubsan compiler?
> 
> F="-O0 -g -fsanitize=undefined" ; make -j16 all-host -k CFLAGS="$F"
> CXXFLAGS="$F"  LDFLAGS="$F"
> 
> is the fastest approach.
Thanks, it is similar to what I tried.  I guess there should be no ";"
but yet it leds to misconfigured libiberty for me on kunlun.  I will
look into that.

[Bug ipa/103230] ipa-modref-tree.h:550:33: runtime error: load of value 255, which is not a valid value for type 'bool'

2021-11-14 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103230

--- Comment #1 from hubicka at kam dot mff.cuni.cz ---
> Happens with UBSAN compiler for:
> 
> $ gcc gcc/testsuite/gcc.c-torture/execute/pr71494.c -O1  -flto
> ...
> /home/marxin/Programming/gcc/gcc/ipa-modref-tree.h:550:33: runtime error: load
> of value 255, which is not a valid value for type 'bool'
> #0 0x18acc38 in modref_tree::merge(modref_tree*,
> vec*, modref_parm_map*, bool)
> /home/marxin/Programming/gcc/gcc/ipa-modref-tree.h:550
> #1 0x188452c in modref_propagate_in_scc

At 4385 I have:
   changed |= cur_summary_lto->stores->merge
(callee_summary_lto->stores, _map, _map, !first);

parm-map is the vector, however there is no read of it.
There is bool which is relevant only when parm_index is not unknown, so
I suspect it may a full copy with uninitialized bool which would be
harmless. We had similar issues with asan before.

How do you build ubsan compiler?
Honza

[Bug ipa/103211] [12 Regression] 416.gamess crashes after r12-5177-g494bdadf28d0fb35

2021-11-12 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103211

--- Comment #3 from hubicka at kam dot mff.cuni.cz ---
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103211
> 
> --- Comment #2 from Martin Liška  ---
> Optimized dump differs for couple of functions in the same way:
> 
> diff -u good bad
> --- good2021-11-12 17:42:36.995947103 +0100
> +++ bad 2021-11-12 17:41:56.728194961 +0100
> @@ -38,7 +38,6 @@
> 
>  ;; Function abrt (abrt_, funcdef_no=10, decl_uid=4338, cgraph_uid=11,
> symbol_order=10) (executed once)
> 
> -Removing basic block 5
>  __attribute__((fn spec (". ")))
>  void abrt ()
>  {
> @@ -350,7 +349,6 @@
>  void setfm (integer(kind=4) * ipar)
>  {
> [local count: 1073741824]:
> -  master.0.setfm (0, ipar_2(D)); [tail call]
>return;
> 
>  }
> 
> maybe the fnspec for master.0.setfm is bad?
> 
> __attribute__((fn spec (". R w ")))
> void master.0.setfm (integer(kind=8) __entry, integer(kind=4) * ipar)
> {
It looks more like pure/const discovery. You should be able to use
-fdump-ipa-all -fdump-tree-all and grep "function found to be" 
either pure or const.

What is body of master.0.setfm. Does it look like it does nothing?

"R" in fnspec means that arg 0 is only read directly and not derefernced.
"w" means that it arg 1 is not escaping.

Honza

[Bug middle-end/102997] [12 Regression] 45% 454.calculix regression with LTO+PGO -march=native -Ofast on Zen since r12-4526-gd8edfadfc7a9795b65177a50ce44fd348858e844

2021-11-12 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102997

--- Comment #31 from hubicka at kam dot mff.cuni.cz ---
> It likely was the loop header copying missing on cold loops then.
Yep. It is good we worked that out.

[Bug tree-optimization/103195] [12 Regression] tfft2 text grows by 70% with -Ofast since r12-5113-gd70ef65692fced7a

2021-11-12 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103195

--- Comment #3 from hubicka at kam dot mff.cuni.cz ---
> > threader stuff would be my bet, but we need to bisect this (tfft2 is also
> > quite small)
> 
> Bad bet ;) It's caused by r12-5113-gd70ef65692fced7a.

Hehe, that was my guess yeterday.  Thanks for confirming.  So now to
figure out why better alias info makes tfft2 grow.  I guess more
vectorization?

Thanks for bisecting this

[Bug tree-optimization/103175] [12 Regression] internal compiler error: in handle_call_arg, at tree-ssa-structalias.c:4139

2021-11-11 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103175

--- Comment #5 from hubicka at kam dot mff.cuni.cz ---
The sanity check verifies that functions acessing parameter indirectly
also reads the parameter (otherwise the indirect reference can not
happen).  This patch moves the check earlier and removes some overactive
flag cleaning on function call boundary which introduces the non-sential
situation.  I got bit paranoid here on how return value relates to
escaping solution.

as discussed on ML, matmul failure is simply the fact that the testcase
verified missed optimization is still misssed.

diff --git a/gcc/ipa-modref.c b/gcc/ipa-modref.c
index 72006251f29..a97021c6c60 100644
--- a/gcc/ipa-modref.c
+++ b/gcc/ipa-modref.c
@@ -1681,6 +1681,13 @@ modref_lattice::merge (int f)
 {
   if (f & EAF_UNUSED)
 return false;
+  /* Check that flags seems sane: if function does not read the parameter
+ it can not access it indirectly.  */
+  gcc_checking_assert (!(f & EAF_NO_DIRECT_READ)
+  || ((f & EAF_NO_INDIRECT_READ)
+  && (f & EAF_NO_INDIRECT_CLOBBER)
+  && (f & EAF_NO_INDIRECT_ESCAPE)
+  && (f & EAF_NOT_RETURNED_INDIRECTLY)));
   if ((flags & f) != flags)
 {
   flags &= f;
@@ -1874,27 +1881,13 @@ modref_eaf_analysis::merge_call_lhs_flags (gcall *call,
int arg,
argument if needed.  */

 static int
-callee_to_caller_flags (int call_flags, bool ignore_stores,
-   modref_lattice )
+callee_to_caller_flags (int call_flags, bool ignore_stores)
 {
   /* call_flags is about callee returning a value
  that is not the same as caller returning it.  */
   call_flags |= EAF_NOT_RETURNED_DIRECTLY
| EAF_NOT_RETURNED_INDIRECTLY;
-  /* TODO: We miss return value propagation.
- Be conservative and if value escapes to memory
- also mark it as escaping.  */
-  if (!ignore_stores && !(call_flags & EAF_UNUSED))
-{
-  if (!(call_flags & EAF_NO_DIRECT_ESCAPE))
-   lattice.merge (~(EAF_NOT_RETURNED_DIRECTLY
-| EAF_NOT_RETURNED_INDIRECTLY
-| EAF_UNUSED));
-  if (!(call_flags & EAF_NO_INDIRECT_ESCAPE))
-   lattice.merge (~(EAF_NOT_RETURNED_INDIRECTLY
-| EAF_UNUSED));
-}
-  else
+  if (ignore_stores)
 call_flags |= ignore_stores_eaf_flags;
   return call_flags;
 }
@@ -2033,15 +2026,9 @@ modref_eaf_analysis::analyze_ssa_name (tree name)
  if (!(call_flags & (EAF_NOT_RETURNED_DIRECTLY
  | EAF_UNUSED)))
m_lattice[index].merge (~(EAF_NO_DIRECT_ESCAPE
- | EAF_NO_INDIRECT_ESCAPE
- | EAF_UNUSED));
- if (!(call_flags & (EAF_NOT_RETURNED_INDIRECTLY
- | EAF_UNUSED)))
-   m_lattice[index].merge (~(EAF_NO_INDIRECT_ESCAPE
  | EAF_UNUSED));
  call_flags = callee_to_caller_flags
-  (call_flags, false,
-   m_lattice[index]);
+  (call_flags, false);
}
  m_lattice[index].merge (call_flags);
}
@@ -2057,8 +2044,7 @@ modref_eaf_analysis::analyze_ssa_name (tree name)
  !(call_flags & EAF_NOT_RETURNED_DIRECTLY),
  !(call_flags & EAF_NOT_RETURNED_INDIRECTLY));
  call_flags = callee_to_caller_flags
-  (call_flags, ignore_stores,
-   m_lattice[index]);
+  (call_flags, ignore_stores);
  if (!(ecf_flags & (ECF_CONST | ECF_NOVOPS)))
m_lattice[index].merge (call_flags);
}
@@ -2082,8 +2068,7 @@ modref_eaf_analysis::analyze_ssa_name (tree name)
if (!(ecf_flags & (ECF_CONST | ECF_NOVOPS)))
  {
call_flags = callee_to_caller_flags
-(call_flags, ignore_stores,
- m_lattice[index]);
+(call_flags, ignore_stores);
if (!record_ipa)
  m_lattice[index].merge (call_flags);
else
@@ -2105,8 +2090,7 @@ modref_eaf_analysis::analyze_ssa_name (tree name)
else
  {
call_flags = callee_to_caller_flags
-(call_flags, ignore_stores,
- m_lattice[index]);
+

[Bug ipa/103164] -fipa-pta degrades aliasing oracle for tramp3d

2021-11-10 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103164

--- Comment #2 from hubicka at kam dot mff.cuni.cz ---
Yep, it only shows that we want to run ipa-pta and local oracle in
parallel since ipa-pta can not be realistically assumed to subsume all
of local PTA  (for example due to being necessarily lossy or very slow
on larger programs)

[Bug ipa/97403] Ancestor jump function should be generalized

2021-11-10 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97403

--- Comment #4 from hubicka at kam dot mff.cuni.cz ---
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97403
> 
> --- Comment #3 from Martin Jambor  ---
> (In reply to Jan Hubicka from comment #2)
> > Martin,
> > I think we can close this (possibly adding the testcase)
> 
> Depends.  ANCESTOR got generalized a bit but the propagation at IPA-CP level
> still does not take place.  That is not a problem with jump functions though,
> for that we need:
> 
> 1) When propagating, look into "aggregate" constants when encountering a pass
> through function and and there is something useful there (there is in this
> case).  But we also need to somehow represent a reference to a simple constant
> (which is not a CONST_DECL) in the lattices.
> 
> 2) In the tree-replacement/transformation phase, find a mechanism to replace
> applicable dereferences of the parameter with the constant.
> 
> Ultimately, a future return function should also get you a constant
> (post-ipa-cp). 
> 
> We can file another bug with the same testcase though.
I guess we can keep this PR as a collection of stuff that could be
improved on aggregates...

Honza

[Bug middle-end/102997] [12 Regression] 45% 454.calculix regression with LTO+PGO -march=native -Ofast on Zen since r12-4526-gd8edfadfc7a9795b65177a50ce44fd348858e844

2021-11-08 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102997

--- Comment #23 from hubicka at kam dot mff.cuni.cz ---
> We verify that by simply looking at the loop depth relation of
> the entry and exit of the path.

Which seem wrong for the path leaving loop and entering another...
> 
> > It seems to me that it should be also OK to thread if we thread all
> > entry edges to the loop (in particular if it is only one) which should
> > be easy to check by looking into other edges entering the path being
> > threaded and checking that they are from loop and not from outside
> > (since the outside edges will keep the old path alive and we end up with
> > duplicated entry point).
> > 
> > I may miss something obvious, but I would say that all other threads
> > involving loops should be fine and better informed than peeling since
> > threader knows that the path in question is special.
> 
> The issue is that the threader does not update loop info and it's
> important to keep things like ->simdlen associated to the correct
> loop and without distorting the loop in a way that invalidates such
> info (pre loop optimization).  The forward threader contained some
> code to keep loops up-to-date but as usual the backwards threader
> is just broken with respect to this.
> 
> So yes, any threading that enters and leaves a loop is OK, but
> a threading that rotates or peels a loop [not completely] is not
> trivially so.

I see. Did not think of that.
Perhaps something to solve incrementally next stage1

Honza

[Bug tree-optimization/103117] uncprop produces harder to analyze but not better code

2021-11-08 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103117

--- Comment #4 from hubicka at kam dot mff.cuni.cz ---
> > I don't know - this way we have separate dumps etc.  I think mistake was
> > scheduling pure-const and later modref too late.
> 
> Maybe.  If you move them please put a comment before uncprop that it
> should be last before RTL expansion.
Will do, thanks!
Honza

[Bug middle-end/102997] [12 Regression] 45% 454.calculix regression with LTO+PGO -march=native -Ofast on Zen since r12-4526-gd8edfadfc7a9795b65177a50ce44fd348858e844

2021-11-08 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102997

--- Comment #21 from hubicka at kam dot mff.cuni.cz ---
> to also allow to thread through a loop path not crossing the latch but
> at least for the issue of "breaking loops" the loops_crossed stuff shouldn't
> be necessary.  It might still prevent us from doing such transforms when
> it involves peeling.

I am not sure what is the set of tranforms we want to forbid in general.
I think main situation we want to prevent is turning single-entry loop
to irreducible loop by forwarding loop entry edges differently.

For that it should be safe to trhead if the path enters a loop and later
leaves. For that we need to keep track of loops we entered in stack and
pop them when we leave it and check that stack is empty at the end.  (or
do stackless bookeeping but it may happen that we start in one loop,
leave it and then enter different loop and stop in it which is stil
unsafe so just comparing loop nest of entry and exit block is not right
thing to do)

It seems to me that it should be also OK to thread if we thread all
entry edges to the loop (in particular if it is only one) which should
be easy to check by looking into other edges entering the path being
threaded and checking that they are from loop and not from outside
(since the outside edges will keep the old path alive and we end up with
duplicated entry point).

I may miss something obvious, but I would say that all other threads
involving loops should be fine and better informed than peeling since
threader knows that the path in question is special.

[Bug tree-optimization/103117] uncprop produces harder to analyze but not better code

2021-11-08 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103117

--- Comment #2 from hubicka at kam dot mff.cuni.cz ---
> I suppose modref could (for pointer returns) use ranger to query its range
> and see if it ever is non-NULL?  I'm not sure if we reliably propagate
> null pointer constants everywhere.

I think we simply want to reorder the passes here...
> 
> Btw, uncprop is supposed to run right before RTL expansion - it is in fact
> an out-of-SSA optimization, so even removing it as separate pass and
> directly calling it from rewrite_out_of_ssa after eliminate_useless_phis
> might be an improvement.

I don't know - this way we have separate dumps etc.  I think mistake was
scheduling pure-const and later modref too late.

Thanks,
Honza

[Bug middle-end/102997] [12 Regression] 45% 454.calculix regression with LTO+PGO -march=native -Ofast on Zen since r12-4526-gd8edfadfc7a9795b65177a50ce44fd348858e844

2021-11-08 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102997

--- Comment #19 from hubicka at kam dot mff.cuni.cz ---
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102997
> 
> --- Comment #18 from Aldy Hernandez  ---
> 
> > If I read it correctly, for a path that enters the loop and later leaves
> > it (where threading is desirable since we skip the whole loop) the logic
> > above will still return true (after finishing the whole walk which seems
> > like a waste).
> 
> Perhaps Richi can comment here, since this is his code?

The testcase would be

void test ()
{
int i;
if (test())
  i=0;
else
  i=100;
for (;i<100 && bar ();i++)
do_work ();
}

Here I think we can thread else path to return w/o any issues with loop
nests even with -Os.

> 
> > 
> > This may trigger more often at -Os since we limit loop header copying.
> 
> Interesting.  We have PR102906 which is a code size increase at -Os on ARM for
> the same commit.  I wonder if it's related.
> 
> > 
> > And indeed, fixing profile updating would be nice.  Why the updating
> > code is not reused across different threaders?  (I wrote several thread
> > updating functions for varioius threaders introduced & remoed in the
> > past and I wonder why we need to keep reinventing it)
> 
> This all predates me.  I don't know.  For the record, both threaders use
> completely different BB copiers and update mechanisms.  It could be they're
> sufficiently different that reusing it is unfeasible ??.

Updating after trheading is not too hard however there is a painful
point that profile may disagree with fact you proved. In general you do
 1) you compute count of path your are threading (i.e. all edges you
 redirect to destination)
 2) all BBs along the path needs to have this count subtracted since
 they are not going to be executed on it anymore
 3) all conditionas on the path needs probabilities updated: you know
 the count entering the conditional and with same count you exit it via
 a particular edge.
 However here one needs to take into account that especially for guessed
 profiles (but for feedback too if code was duplicated earlier) the
 branch probability may be in a way that the leaving edge count is less
 than the count you want to subtract.

Honza

[Bug middle-end/102997] [12 Regression] 45% 454.calculix regression with LTO+PGO -march=native -Ofast on Zen since r12-4526-gd8edfadfc7a9795b65177a50ce44fd348858e844

2021-11-08 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102997

--- Comment #16 from hubicka at kam dot mff.cuni.cz ---
Note that it still seems to me that the crossed_loop_header handling is
overly conservative.  We have:

@ -2771,6 +2771,7 @@ jt_path_registry::cancel_invalid_paths
(vec )
   bool seen_latch = false;
   int loops_crossed = 0;
   bool crossed_latch = false;
+  bool crossed_loop_header = false;
   // Use ->dest here instead of ->src to ignore the first block.  The
   // first block is allowed to be in a different loop, since it'll be
   // redirected.  See similar comment in profitable_path_p: "we don't
@@ -2804,6 +2805,14 @@ jt_path_registry::cancel_invalid_paths
(vec )
  ++loops_crossed;
}

+  // ?? Avoid threading through loop headers that remain in the
+  // loop, as such threadings tend to create sub-loops which
+  // _might_ be OK ??.
+  if (e->dest->loop_father->header == e->dest
+ && !flow_loop_nested_p (exit->dest->loop_father,
+ e->dest->loop_father))
+   crossed_loop_header = true;
+
   if (flag_checking && !m_backedge_threads)
gcc_assert ((path[i]->e->flags & EDGE_DFS_BACK) == 0);
 }
@@ -2829,6 +2838,21 @@ jt_path_registry::cancel_invalid_paths
(vec )
   cancel_thread (, "Path crosses loops");
   return true;
 }
+  // The path should either start and end in the same loop or exit the
+  // loop it starts in but never enter a loop.  This also catches
+  // creating irreducible loops, not only rotation.
+  if (entry->src->loop_father != exit->dest->loop_father
+  && !flow_loop_nested_p (exit->src->loop_father,
+ entry->dest->loop_father))
+{
+  cancel_thread (, "Path rotates loop");
+  return true;
+}
+  if (crossed_loop_header)
+{
+  cancel_thread (, "Path crosses loop header but does not exit it");
+  return true;
+}
   return false;
 }

If I read it correctly, for a path that enters the loop and later leaves
it (where threading is desirable since we skip the whole loop) the logic
above will still return true (after finishing the whole walk which seems
like a waste).

This may trigger more often at -Os since we limit loop header copying.

And indeed, fixing profile updating would be nice.  Why the updating
code is not reused across different threaders?  (I wrote several thread
updating functions for varioius threaders introduced & remoed in the
past and I wonder why we need to keep reinventing it)

[Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r

2021-11-07 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

--- Comment #27 from hubicka at kam dot mff.cuni.cz ---
> 
> This PR is still open, at least for slowdown in the threader with LTO.  The
> issue is ranger wide, so it may also cause slowdowns  on non-LTO builds for
> WRF, though I haven't checked.
I just wanted to record the fact somewhere since I was looking up the
revision range mostly to figure out if there was modref change that may
cause this.

Non-lto builds seems fine.  I suppose LTo is needed ot make bug enough
CFGs.  Thanks for looking into it.

Honza

[Bug fortran/103058] [12 Regression] ICE in gimple_call_static_chain_flags, at gimple.c:1669 when building 527.cam4_r

2021-11-05 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103058

--- Comment #10 from hubicka at kam dot mff.cuni.cz ---
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103058
> 
> --- Comment #9 from Martin Liška  ---
> And WPA cgraph dump tells:
> 
> quick_sort_1.1/213 (quick_sort_1) @0x774c2550
>   Type: function definition analyzed
>   Visibility: prevailing_def_ironly
>   References: 
>   Referring: 
>   Read from file: quicksort.fppized.o
>   Availability: local
>   Unit id: 2
>   Function flags: count:19391457 (estimated locally) local
>   Called by: __quicksort_MOD_quick_sort/212 (1073741824 (estimated
> locally),1.00 per call) quick_sort_1.1/213 (4287451 (estimated locally),0.22
> per call) quick_sort_1.1/213 (4287451 (estimated locally),0.22 per call) 
>   Calls: quick_sort_1.1/213 (4287451 (estimated locally),0.22 per call)
> quick_sort_1.1/213 (4287451 (estimated locally),0.22 per call)
> interchange_sort.0/211 (6399181 (estimated locally),0.33 per call) 
> 
> So the DECL_EXTERNAL should be false, right?
This is due to partitining which exports the symbol between WPA and
ltrans. Still at ltrans time binds_to_currrent_def should
return true because we have prevailing_def_ironly resolution info.

I will write some verifier on this.
Honza

[Bug ipa/103073] [12 Regression] ICE in insert_access, at ipa-modref-tree.h:578 since r12-4401-gfecd145359fc981b

2021-11-05 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103073

--- Comment #13 from hubicka at kam dot mff.cuni.cz ---
> > diff --git a/gcc/ipa-modref-tree.h b/gcc/ipa-modref-tree.h
> > index 9976e489697..1b51323175b 100644
> > --- a/gcc/ipa-modref-tree.h
> > +++ b/gcc/ipa-modref-tree.h
> > @@ -813,6 +818,20 @@ struct GTY((user)) modref_tree
> > 
> >  bool changed = false;
> > 
> > +/* We may end up with max_size being less than size for accesses past 
> > the
> > +   end of array.  Those are undefined and safe to ignore.  */
> > +if (a.range_info_useful_p ()
> > +   && ((known_size_p (a.size) && known_size_p (a.max_size)
> > +&& known_lt (a.max_size, a.size))
> 
> What about maybe_lt?  Well, you should know the ICEing place and
> whether it's sensitive to may or must ;)

The range merging really went funny way because max_size == 0
and I hope in that case we will always have known_lt.
But indeed I may need to proofread the merging logic for maybes (I even
do not know how to produce testcases with non-trivial polyints here).

Maybe_lt is IMO not safe since it may make us to ignore stores that are
valid on runtime (given that the variable length vector type is small
enough to fit in max_size).
> 
> > +   || (known_size_p (a.max_size)
> > +   && known_le (a.max_size, 0
> 
> The known_size_p (a.max_size) && known_le (a.max_size, 0) should never
> be true (there's only the -1 special value denoting 'unknown').

OK, I will add a sanity check since it seemed from the code earlier that
negative values may happen (which would be indeed sloppy if it was
happening).
> 
> Yeah, apart from the remark above.

Thanks. I will update the patch.
Honza
> 
> -- 
> You are receiving this mail because:
> You are the assignee for the bug.
> You are on the CC list for the bug.

[Bug ipa/103073] [12 Regression] ICE in insert_access, at ipa-modref-tree.h:578 since r12-4401-gfecd145359fc981b

2021-11-05 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103073

--- Comment #12 from hubicka at kam dot mff.cuni.cz ---
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103073
> 
> --- Comment #10 from Martin Liška  ---
> > This bootstraps/regtests and fixes the testcase.  Does it look sane to
> > you?
> 
> Note this ended in bugzilla and not in gcc-patches ML.

Well, I am not proposing it for gcc-patches yet because from Richard
comment on this PR I have feeling that we are not quite in mutual
understanding about what really happens here yet.

[Bug ipa/103073] [12 Regression] ICE in insert_access, at ipa-modref-tree.h:578 since r12-4401-gfecd145359fc981b

2021-11-05 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103073

--- Comment #8 from hubicka at kam dot mff.cuni.cz ---
> Well, the usual thing to do is to check max_size_known_p () and
> if maybe_ne (max_size, size) then use [offset, max_size] for 
> disambiguation.  I think for modref you can do the same - if max size
> is known then use [offset, max_size], otherwise you have to punt.  You
> shouldn't need 'size' at all, 'size' is when you are looking for
> must-defs.

While disambiguating ref with decl we also check if size is greater than
size of decl and in that case we disambiguate.  So tracking sizes helps
little bit even if not checking for kills.

I plan to do also kills using modrefs. This helps to propagate clobber
inter-procedurally. One simply needs one extra flag tracking if store
must be executed before function returns (I have patch for this).

Hoever still I am convinced I can simply ignore the range here since
from VRP we know it will be undefined if ever executed as follows:

diff --git a/gcc/ipa-modref-tree.h b/gcc/ipa-modref-tree.h
index 9976e489697..1b51323175b 100644
--- a/gcc/ipa-modref-tree.h
+++ b/gcc/ipa-modref-tree.h
@@ -813,6 +818,20 @@ struct GTY((user)) modref_tree

 bool changed = false;

+/* We may end up with max_size being less than size for accesses past the
+   end of array.  Those are undefined and safe to ignore.  */
+if (a.range_info_useful_p ()
+   && ((known_size_p (a.size) && known_size_p (a.max_size)
+&& known_lt (a.max_size, a.size))
+   || (known_size_p (a.max_size)
+   && known_le (a.max_size, 0
+  {
+   if (dump_file)
+ fprintf (dump_file,
+  "   - Paradoxical ragne. Ignoring\n");
+   return false;
+  }
+
 /* No useful information tracked; collapse everything.  */
 if (!base && !ref && !a.useful_p ())
   {

Similarly we could detect this as undefined effect and turn to
trap/unreachable somewhere if we care.

This bootstraps/regtests and fixes the testcase.  Does it look sane to
you?

Honza

[Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r

2021-11-04 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

--- Comment #19 from hubicka at kam dot mff.cuni.cz ---
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943
> 
> Aldy Hernandez  changed:
> 
>What|Removed |Added
> 
>  Depends on||103058
> 
> --- Comment #18 from Aldy Hernandez  ---
> 251.wrf_r is no longer building.  Seems to be the same issue in PR103058.
> 
> during GIMPLE pass: alias
> module_fr_fire_phys.fppized.f90: In function 'init_fuel_cats':
> module_fr_fire_phys.fppized.f90:136:25: internal compiler error: in
> gimple_call_static_chain_flags, at gimple.c:1669
>   136 | subroutine init_fuel_cats
>   | ^
> 0x6957b5 gimple_call_static_chain_flags(gcall const*)
> /home/aldyh/src/clean/gcc/gimple.c:1669

I have commited workaround for this.
However here it looks like a frontend issue - I do not think Fortran
should produce nested functions with external linkage. At least there
seems to be no good reason for doing so since they can not be called
cross-module.

Honza

[Bug ipa/103058] [12 Regression] ICE in gimple_call_static_chain_flags, at gimple.c:1669 when building 527.cam4_r

2021-11-04 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103058

--- Comment #4 from hubicka at kam dot mff.cuni.cz ---
Hi,
I am testing the following to unbreak fortran.
However the real bug is that binds_to_current_def should work on whole
WPA and be independent of partitioning.  I remember I had patch fixing
that. Will look it up.

Honza

diff --git a/gcc/gimple.c b/gcc/gimple.c
index 76768c19c8e..7a578f5113e 100644
--- a/gcc/gimple.c
+++ b/gcc/gimple.c
@@ -1666,7 +1666,18 @@ gimple_call_static_chain_flags (const gcall *stmt)
  int modref_flags = summary->static_chain_flags;

  /* We have possibly optimized out load.  Be conservative here.  */
- gcc_checking_assert (node->binds_to_current_def_p ());
+ if (!node->binds_to_current_def_p ())
+   {
+ if ((modref_flags & EAF_UNUSED) && !(flags & EAF_UNUSED))
+   {
+ modref_flags &= ~EAF_UNUSED;
+ modref_flags |= EAF_NOESCAPE;
+   }
+ if ((modref_flags & EAF_NOREAD) && !(flags & EAF_NOREAD))
+   modref_flags &= ~EAF_NOREAD;
+ if ((modref_flags & EAF_DIRECT) && !(flags & EAF_DIRECT))
+   modref_flags &= ~EAF_DIRECT;
+   }
  if (dbg_cnt (ipa_mod_ref_pta))
flags |= modref_flags;
}

  1   2   >