Re: [Predicated Ins vs Branches] O3 and PGO result in 2x performance drop relative to O2

Changbin Du via Gcc-bugs Tue, 01 Aug 2023 05:33:33 -0700

On Tue, Aug 01, 2023 at 10:44:02AM +0200, Jan Hubicka wrote:
> > > If I comment it out as above patch, then O3/PGO can get 16% and 12% 
> > > performance
> > > improvement compared to O2 on x86.
> > >
> > >                         O2              O3              PGO
> > > cycles                  2,497,674,824   2,104,993,224   2,199,753,593
> > > instructions            10,457,508,646  9,723,056,131   10,457,216,225
> > > branches                2,303,029,380   2,250,522,323   2,302,994,942
> > > branch-misses           0.00%           0.01%           0.01%
> > >
> > > The main difference in the compilation output about code around the 
> > > miss-prediction
> > > branch is:
> > >   o In O2: predicated instruction (cmov here) is selected to eliminate 
> > > above
> > >     branch. cmov is true better than branch here.
> > >   o In O3/PGO: bitout() is inlined into encode_file(), and branch 
> > > instruction
> > >     is selected. But this branch is obviously *unpredictable* and the 
> > > compiler
> > >     doesn't know it. This why O3/PGO are are so bad for this program.
> > >
> > > Gcc doesn't support __builtin_unpredictable() which has been introduced 
> > > by llvm.
> > > Then I tried to see if __builtin_expect_with_probability(e,x, 0.5) can 
> > > serve the
> > > same purpose. The result is negative.
> > 
> > But does it appear to be predictable with your profiling data?
> 
> Also one thing is that __builtin_expect and
> __builtin_expect_with_probability only affects the static branch
> prediciton algorithm, so with profile feedback they are ignored on every
> branch executed at least once during the train run.
> 
> setting probability 0.5 is really not exactly the same as hint that the
> branch will be mispredicted, since modern CPUs handle well regularly
> behaving branchs (such as a branch firing every even iteration of loop).
>
Yeah. Setting probability 0.5 is just an experimental attempt. I don't know
how the heuristic works internally.


> So I think having the builting is not a bad idea.  I was thinking if it
> makes sense to represent it withing profile_probability type and I am
> not convinced, since "unpredictable probability" sounds counceptually
> odd and we would need to keep the flag intact over all probability
> updates we do.  For things like loop exits we recompute probabilities
> from frequencies after unrolling/vectorizaiton and other things and we
> would need to invent new API to propagate the flag from previous
> probability (which is not even part of the computation right now)
> 
> So I guess the challenge is how to pass this info down through the
> optimization pipeline, since we would need to annotate gimple
> conds/switches and manage it to RTL level.  On gimple we have flags and
> on rtl level notes so there is space for it, but we would need to
> maintain the info through CFG changes.
> 
> Auto-FDO may be interesting way to detect such branches.
> 
So I suppose PGO also could. But branch instruction is selected in my test just
as O3 does. And data shows that comv works better than branch here.

> Honza
> > 
> > > I think we could come to a conclusion that there must be something can 
> > > improve in
> > > Gcc's heuristic strategy about Predicated Instructions and branches, at 
> > > least
> > > for O3 and PGO.
> > >
> > > And can we add __builtin_unpredictable() support for Gcc? As usually it's 
> > > hard
> > > for the compiler to detect unpredictable branches.
> > >
> > > --
> > > Cheers,
> > > Changbin Du

-- 
Cheers,
Changbin Du

Re: [Predicated Ins vs Branches] O3 and PGO result in 2x performance drop relative to O2

Reply via email to