On Thu, 11 Sep 2025, Tamar Christina wrote:

> > -----Original Message-----
> > From: Richard Biener <[email protected]>
> > Sent: Thursday, September 11, 2025 12:56 PM
> > To: Tamar Christina <[email protected]>
> > Cc: Robin Dapp <[email protected]>; GCC Patches <gcc-
> > [email protected]>; [email protected]
> > Subject: Re: New optabs and IFN required for early break [bikeshed]
> > 
> > On Thu, 11 Sep 2025, Tamar Christina wrote:
> > 
> > > This email is to discuss the optabs and IFNs that I will need to finish 
> > > off the
> > > early break optimizations in GCC 16 and to minimize respins on the final 
> > > patches.
> > >
> > > Optabs only, These will be handled through expand rewriting them if the 
> > > targets
> > > supports them:
> > >
> > > -- cbranch
> > >
> > > At the moment we model vector compare and branch using cbranch on vector
> > > arguments. e.g.
> > >
> > > a = b `cmp` c
> > > if (a <cmp2> 0)
> > >   ...
> > >
> > > where <cmp2> is != or == reflecting whether we want ANY or ALL.
> > >
> > > The plan is to deprecate this in favor of an explicit vector compare, 
> > > which
> > > allows us to elide the != 0 and == 0.  This allows us to be able to 
> > > generate
> > > more efficient code in the backends.  For instance for Adv. SIMD we can 
> > > now use
> > > an SVE compare + branch for Adv. SIMD.
> > >
> > > For this feature 6 new optabs are needed, and cbranch will be made
> > unsupported
> > > for vector arguments and I will update all backends.
> > >
> > > -- vec_cbranch_any, vec_cbranch_all
> > >
> > > These are unpredicated version of the compares.  This means this version 
> > > also
> > > does not have an ELSE argument as all lanes are relevant.
> > >
> > > cond_vec_cbranch_any, cond_vec_cbranch_all
> > > cond_len_vec_cbranch_any, cond_len_vec_cbranch_all
> > >
> > > These are the predicated equivalent of the above.  These take the 
> > > predicate/len
> > > of the operation and also takes an ELSE argument for what happens when the
> > > predicate is inactive.  I think the default is zero.  But I'm unsure how 
> > > to
> > > handle when a target wants something other than zero since this is an 
> > > expand
> > time
> > > rewriting and so I can't reject the expansion.
> > 
> > You are rewriting from a GIMPLE_COND, right?  If so the vectorizer
> > should have verified an appropriate expansion exists (I suggest to
> > separate a function that selects the optab you can call from both
> > expansion and vectorization for this)
> 
> But how do I communicate this preference? I guess we'd need some kind of
> target hook? But cond_* variants are an optimization on top of vec_cbranch I 
> didn't
> think we'd need to check for the cond_ variant in the vectorizer. As we have a
> default fallback.

Look at how we communicate the else value elsehwere.  As it's not
going to be explicit on the GIMPLE_COND I'd hope it will re-analyze
the same (so use a common function).

> > 
> > > -- vec_cmp
> > >
> > > At the moment vector compares are always unpredicated. Masking or LENs are
> > > applied to the result of the unpredicated compare.
> > >
> > > E.g.
> > >
> > > a = b `cmp` c
> > > d = a & loop_mask
> > >
> > > and we rely on combine to push the mask into the compare later on.  
> > > however
> > with
> > > the introduction of cond_vec_cbranch etc we rely on CSE being able to CSE
> > duplicate
> > > compares.  CSE however runs before combine and so we need a mechanism to
> > expand
> > > the compares as predicated from the start.
> > >
> > > This also allows us to simplify some backend patterns.  The proposed 
> > > optabs are
> > >
> > > -- cond_vec_cmp, cond_len_vec_cmp
> > >
> > > which are similar to vec_cmp$a$b such that it takes a condition code as
> > argument
> > > and a predicate or len.  This is an optimization and so it's an optional 
> > > optab
> > > for targets.  A target would need to implement both vec_cmp and 
> > > cond_vec_cmp
> > > to get the optimization.
> > 
> > So like the branch case this needs an else value, just in case this
> > wasn't obvious.
> 
> Yep.
> 
> > 
> > > --- Predicate operations
> > >
> > > The following ones will get new optabs and IFN and will be generated by 
> > > the
> > > vectorizer.  At the moment I don't know how the RVV code deals with 
> > > similar
> > > operations.  I will assume that for their case they work on 
> > > masks/predicates as
> > > well as that is the result of the vector compares.  Do tell me if I need 
> > > a LEN
> > > version of these, but I would appreciate links to instructions so I know 
> > > how to
> > > expect them to be used.
> > >
> > > These operations are used to allow us to not need the scalar epilogue.
> > >
> > > -- Predicate partitioning
> > >
> > > When an early break is vectorized
> > >
> > > if (a `cmp` b)
> > >
> > > Then if we want to avoid the scalar the scalar epilogue we would need to 
> > > handle
> > > such cases as
> > >
> > > return i
> > > return a[i]
> > >
> > > etc.
> > >
> > > The first one is return i.  My patches change it such that we now use a 
> > > scalar
> > > IV for the forced live IVs.  Which means return i becomes
> > 
> > So that's possibly an additional scalar IV, right?
> 
> Yes that's right.
> 
> > 
> > > mask1 = all_before_first (mask)
> > > scalar_iv + count_active (mask1);
> > 
> > So 'mask' is the mask for the if (a cmp b), aka it indicates whether
> > we exited for a lane.  So this is ffs (mask) - 1?
> 
> Yeah correct.
> 
> > 
> > > all_before_first and all_after_first map to the SVE instructions BRKB and 
> > > BRKA
> > > essentially all_before_first makes all lanes as active in the predicate 
> > > until
> > > but not including the first active element.
> > >
> > > So a mask of {0,0,0,1} becomes {1,1,1,0} such that the scalar_iv + 3 
> > > which gives
> > > us the final value of i.  In SVE the count_active becomes CNTP and x + 
> > > CNTP
> > > gets optimized to INCP, so we don't need to model INCP as optabs.
> > >
> > > The mask1 will also be used if the loop had a store, so we can only do the
> > > partials stores outstanding when we exit the loop.  i.e. store movement 
> > > would
> > > now also duplicate the stores in the exit block and mask them by mask1.
> > 
> > So the original scalar IV had an evolution { init, +, step }, the
> > "vector" IV will have { init, +, VF * step } then, right?
> 
> Yeah, and IVopts is actually pretty good in mixing this new scalar IV with 
> the IV
> the vectorizer generates as new loop control.  So we actually in most cases 
> only
> end up having one scalar IV in the end, just with different addressing.
> 
> >  I wonder
> > whether in case we have an actual vector IV for the IV (because of
> > other uses), the epilog could use i == vector-i[0], so extract
> > element zero from a vector IV to compute the scalar i?
> 
> We'd still have a normal scalar IV there wouldn't we?  I assume you're
> talking about the case where e.g. you have a[i] = i ?

No, I meant for the return i; case, if we have a vector_i[] then
the scalar IV i on exit with the evolution above is vector_i[0], no?
If you need the 'i' for the lane that exited you then add ffs (mask) - 1
to that.

And yes, we have a scalar counting IV, but currently that counts
vector iterations up from zero to N.  I'm not sure IVOPTs can easily
relate the two?

> My code atm generates both the vector and scalar IV and the vector IV becomes
> unused.   I've done this for the reason you outlined above. i.e. that if you 
> actually
> need the vector IV that the reduction would be potentially better.
> 
> This was also practical because the vectorizer didn't like me generating 
> scalar code
> only from a node in an SLP tree.  I tried to not do this, but it resulted in 
> quite invasive
> changes.  So now I just generate the scalar IV as an additional output, which 
> was cleaner
> and worked nicely, but also gave me the option I described above.

I see.  It's already a bit awkward how we handle fold-left reduction
scalar reduction IVs, so I guess expected.  I would expect putting a
scalar def as "vector def" works fine, but of course not if you need
both, the scalar and the vector one.

> > 
> > > return a[i]
> > >
> > > is handled by using extract_first_active, in SVE these map to the COMPACT 
> > > or
> > > SPLICE instruction depending on the size of the lanes. The usage would be
> > 
> > So AVX512 has vcompressp{d,s} and vexpandp{d,s} (but nothing for smaller
> > integer element types).  Those could be used for this but they have
> > a vector result (and element zero would be the first active).
> > 
> 
> This is a good point.  COMPACT is actually the same, it returns a vector with
> all the active elements shuffled down.  Which I think is the same as 
> vcompress.
> 
> I could instead make it return a vector and do an explicit vec_select. 

Yes.

> > But don't you possibly want the last inactive as well, dependent on
> > whether this is a peeled/not peeled exit?  We could shift the mask
> > by one for either case.
> 
> Yeah, but I think peeled can be handled with all_including_first
> So for {0,0,1,1} it returns {1,1,1,0} and then we can use the existing
> EXTRACT_LAST.
> 
> This would be BRKB in SVE.  So this should already be covered in the proposal.
> 
> I do realize the name I used before isn't very clear. So lets rename, the two
> optabs/ifn we need is all_before_first, all_including_first.

What do these do?  Count lanes in a maks?

> > 
> > vcompresspd is not available on AVX2 or SSE4.1, using a vector-vector
> > permute to get the element 'i' % nunits to lane zero would be another
> > possibility, also for non-float or double sized elements we need sth
> > like this.
> > 
> > I do wonder whether we want to have the compress/expand as actual
> > optabs when we use them.  Having an extract_first (without _active,
> > following extract_last_optab) is probably OK to abstract this to
> > some extent.  extract_last doesn't specify what happens if no
> > bit is set in the mask, fold_extract_last seems to be the same
> > but with an else value - I wonder whether we should canonicalize
> > those and thus have an else value for extract_first.
> 
> I assume you then mean cond_extract_first? Since we'd definitely
> need the mask though to quantify what first is.

extract_last already has a mask argument.  "cond" doesn't really
fit here as we only have a single scalar result?  So I'd follow
the namign scheme of extract_last (though fold_ is odd here).

> > 
> > > extract_first_active (vect_a, mask)
> > >
> > > Note that for this IFN it is undefined what would happen if mask is all 
> > > false.
> > > This is required to match the SVE semantics of the operations but it's
> > > guaranteed the vectorizer does only calls the IFN when at least one lane 
> > > is
> > > active.
> > >
> > > I do not believe I need a LEN version here either? But If If I'm wrong It 
> > > would
> > > be useful to have a small example.
> > 
> > I think you need a len variant unless the mask producer had len
> > applied with an else value of 0 (IIRC RVV always preferse 'undefined'
> > as else value).  OTOH the "first" element - if one is set and we never
> > require 'else' - should work with or without loop masking (with len or
> > mask).
> 
> Yeah, it would have, since the mask would always come from a compare
> which have len applied to it.  But I'm wondering what the RVV equivalent
> here is.
> 
> > 
> > That said, I do wonder why we have both extract_last and
> > fold_extract_last.
> 
> I.. also wondered this.  I remember asking this before :')

It somehow suggests the ISA has two variants?
 
> Tamar
> 
> > 
> > Richard.
> > 
> > >
> > > Thanks,
> > > Tamar
> > >
> > 
> > --
> > Richard Biener <[email protected]>
> > SUSE Software Solutions Germany GmbH,
> > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
> 

-- 
Richard Biener <[email protected]>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Reply via email to