> -----Original Message-----
> From: Richard Biener <[email protected]>
> Sent: Thursday, September 11, 2025 12:56 PM
> To: Tamar Christina <[email protected]>
> Cc: Robin Dapp <[email protected]>; GCC Patches <gcc-
> [email protected]>; [email protected]
> Subject: Re: New optabs and IFN required for early break [bikeshed]
> 
> On Thu, 11 Sep 2025, Tamar Christina wrote:
> 
> > This email is to discuss the optabs and IFNs that I will need to finish off 
> > the
> > early break optimizations in GCC 16 and to minimize respins on the final 
> > patches.
> >
> > Optabs only, These will be handled through expand rewriting them if the 
> > targets
> > supports them:
> >
> > -- cbranch
> >
> > At the moment we model vector compare and branch using cbranch on vector
> > arguments. e.g.
> >
> > a = b `cmp` c
> > if (a <cmp2> 0)
> >   ...
> >
> > where <cmp2> is != or == reflecting whether we want ANY or ALL.
> >
> > The plan is to deprecate this in favor of an explicit vector compare, which
> > allows us to elide the != 0 and == 0.  This allows us to be able to generate
> > more efficient code in the backends.  For instance for Adv. SIMD we can now 
> > use
> > an SVE compare + branch for Adv. SIMD.
> >
> > For this feature 6 new optabs are needed, and cbranch will be made
> unsupported
> > for vector arguments and I will update all backends.
> >
> > -- vec_cbranch_any, vec_cbranch_all
> >
> > These are unpredicated version of the compares.  This means this version 
> > also
> > does not have an ELSE argument as all lanes are relevant.
> >
> > cond_vec_cbranch_any, cond_vec_cbranch_all
> > cond_len_vec_cbranch_any, cond_len_vec_cbranch_all
> >
> > These are the predicated equivalent of the above.  These take the 
> > predicate/len
> > of the operation and also takes an ELSE argument for what happens when the
> > predicate is inactive.  I think the default is zero.  But I'm unsure how to
> > handle when a target wants something other than zero since this is an expand
> time
> > rewriting and so I can't reject the expansion.
> 
> You are rewriting from a GIMPLE_COND, right?  If so the vectorizer
> should have verified an appropriate expansion exists (I suggest to
> separate a function that selects the optab you can call from both
> expansion and vectorization for this)

But how do I communicate this preference? I guess we'd need some kind of
target hook? But cond_* variants are an optimization on top of vec_cbranch I 
didn't
think we'd need to check for the cond_ variant in the vectorizer. As we have a
default fallback.

> 
> > -- vec_cmp
> >
> > At the moment vector compares are always unpredicated. Masking or LENs are
> > applied to the result of the unpredicated compare.
> >
> > E.g.
> >
> > a = b `cmp` c
> > d = a & loop_mask
> >
> > and we rely on combine to push the mask into the compare later on.  however
> with
> > the introduction of cond_vec_cbranch etc we rely on CSE being able to CSE
> duplicate
> > compares.  CSE however runs before combine and so we need a mechanism to
> expand
> > the compares as predicated from the start.
> >
> > This also allows us to simplify some backend patterns.  The proposed optabs 
> > are
> >
> > -- cond_vec_cmp, cond_len_vec_cmp
> >
> > which are similar to vec_cmp$a$b such that it takes a condition code as
> argument
> > and a predicate or len.  This is an optimization and so it's an optional 
> > optab
> > for targets.  A target would need to implement both vec_cmp and cond_vec_cmp
> > to get the optimization.
> 
> So like the branch case this needs an else value, just in case this
> wasn't obvious.

Yep.

> 
> > --- Predicate operations
> >
> > The following ones will get new optabs and IFN and will be generated by the
> > vectorizer.  At the moment I don't know how the RVV code deals with similar
> > operations.  I will assume that for their case they work on 
> > masks/predicates as
> > well as that is the result of the vector compares.  Do tell me if I need a 
> > LEN
> > version of these, but I would appreciate links to instructions so I know 
> > how to
> > expect them to be used.
> >
> > These operations are used to allow us to not need the scalar epilogue.
> >
> > -- Predicate partitioning
> >
> > When an early break is vectorized
> >
> > if (a `cmp` b)
> >
> > Then if we want to avoid the scalar the scalar epilogue we would need to 
> > handle
> > such cases as
> >
> > return i
> > return a[i]
> >
> > etc.
> >
> > The first one is return i.  My patches change it such that we now use a 
> > scalar
> > IV for the forced live IVs.  Which means return i becomes
> 
> So that's possibly an additional scalar IV, right?

Yes that's right.

> 
> > mask1 = all_before_first (mask)
> > scalar_iv + count_active (mask1);
> 
> So 'mask' is the mask for the if (a cmp b), aka it indicates whether
> we exited for a lane.  So this is ffs (mask) - 1?

Yeah correct.

> 
> > all_before_first and all_after_first map to the SVE instructions BRKB and 
> > BRKA
> > essentially all_before_first makes all lanes as active in the predicate 
> > until
> > but not including the first active element.
> >
> > So a mask of {0,0,0,1} becomes {1,1,1,0} such that the scalar_iv + 3 which 
> > gives
> > us the final value of i.  In SVE the count_active becomes CNTP and x + CNTP
> > gets optimized to INCP, so we don't need to model INCP as optabs.
> >
> > The mask1 will also be used if the loop had a store, so we can only do the
> > partials stores outstanding when we exit the loop.  i.e. store movement 
> > would
> > now also duplicate the stores in the exit block and mask them by mask1.
> 
> So the original scalar IV had an evolution { init, +, step }, the
> "vector" IV will have { init, +, VF * step } then, right?

Yeah, and IVopts is actually pretty good in mixing this new scalar IV with the 
IV
the vectorizer generates as new loop control.  So we actually in most cases only
end up having one scalar IV in the end, just with different addressing.

>  I wonder
> whether in case we have an actual vector IV for the IV (because of
> other uses), the epilog could use i == vector-i[0], so extract
> element zero from a vector IV to compute the scalar i?

We'd still have a normal scalar IV there wouldn't we?  I assume you're
talking about the case where e.g. you have a[i] = i ?

My code atm generates both the vector and scalar IV and the vector IV becomes
unused.   I've done this for the reason you outlined above. i.e. that if you 
actually
need the vector IV that the reduction would be potentially better.

This was also practical because the vectorizer didn't like me generating scalar 
code
only from a node in an SLP tree.  I tried to not do this, but it resulted in 
quite invasive
changes.  So now I just generate the scalar IV as an additional output, which 
was cleaner
and worked nicely, but also gave me the option I described above.

> 
> > return a[i]
> >
> > is handled by using extract_first_active, in SVE these map to the COMPACT or
> > SPLICE instruction depending on the size of the lanes. The usage would be
> 
> So AVX512 has vcompressp{d,s} and vexpandp{d,s} (but nothing for smaller
> integer element types).  Those could be used for this but they have
> a vector result (and element zero would be the first active).
> 

This is a good point.  COMPACT is actually the same, it returns a vector with
all the active elements shuffled down.  Which I think is the same as vcompress.

I could instead make it return a vector and do an explicit vec_select. 

> But don't you possibly want the last inactive as well, dependent on
> whether this is a peeled/not peeled exit?  We could shift the mask
> by one for either case.

Yeah, but I think peeled can be handled with all_including_first
So for {0,0,1,1} it returns {1,1,1,0} and then we can use the existing
EXTRACT_LAST.

This would be BRKB in SVE.  So this should already be covered in the proposal.

I do realize the name I used before isn't very clear. So lets rename, the two
optabs/ifn we need is all_before_first, all_including_first.

> 
> vcompresspd is not available on AVX2 or SSE4.1, using a vector-vector
> permute to get the element 'i' % nunits to lane zero would be another
> possibility, also for non-float or double sized elements we need sth
> like this.
> 
> I do wonder whether we want to have the compress/expand as actual
> optabs when we use them.  Having an extract_first (without _active,
> following extract_last_optab) is probably OK to abstract this to
> some extent.  extract_last doesn't specify what happens if no
> bit is set in the mask, fold_extract_last seems to be the same
> but with an else value - I wonder whether we should canonicalize
> those and thus have an else value for extract_first.

I assume you then mean cond_extract_first? Since we'd definitely
need the mask though to quantify what first is.

> 
> > extract_first_active (vect_a, mask)
> >
> > Note that for this IFN it is undefined what would happen if mask is all 
> > false.
> > This is required to match the SVE semantics of the operations but it's
> > guaranteed the vectorizer does only calls the IFN when at least one lane is
> > active.
> >
> > I do not believe I need a LEN version here either? But If If I'm wrong It 
> > would
> > be useful to have a small example.
> 
> I think you need a len variant unless the mask producer had len
> applied with an else value of 0 (IIRC RVV always preferse 'undefined'
> as else value).  OTOH the "first" element - if one is set and we never
> require 'else' - should work with or without loop masking (with len or
> mask).

Yeah, it would have, since the mask would always come from a compare
which have len applied to it.  But I'm wondering what the RVV equivalent
here is.

> 
> That said, I do wonder why we have both extract_last and
> fold_extract_last.

I.. also wondered this.  I remember asking this before :')

Tamar

> 
> Richard.
> 
> >
> > Thanks,
> > Tamar
> >
> 
> --
> Richard Biener <[email protected]>
> SUSE Software Solutions Germany GmbH,
> Frankenstrasse 146, 90461 Nuernberg, Germany;
> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Reply via email to