On Fri, 12 Sep 2025, Tamar Christina wrote:

> > -----Original Message-----
> > From: Richard Biener <[email protected]>
> > Sent: Friday, September 12, 2025 1:40 PM
> > To: Tamar Christina <[email protected]>
> > Cc: Robin Dapp <[email protected]>; GCC Patches <gcc-
> > [email protected]>; [email protected]
> > Subject: RE: New optabs and IFN required for early break [bikeshed]
> > 
> > On Thu, 11 Sep 2025, Tamar Christina wrote:
> > 
> > > > -----Original Message-----
> > > > From: Richard Biener <[email protected]>
> > > > Sent: Thursday, September 11, 2025 12:56 PM
> > > > To: Tamar Christina <[email protected]>
> > > > Cc: Robin Dapp <[email protected]>; GCC Patches <gcc-
> > > > [email protected]>; [email protected]
> > > > Subject: Re: New optabs and IFN required for early break [bikeshed]
> > > >
> > > > On Thu, 11 Sep 2025, Tamar Christina wrote:
> > > >
> > > > > This email is to discuss the optabs and IFNs that I will need to 
> > > > > finish off the
> > > > > early break optimizations in GCC 16 and to minimize respins on the 
> > > > > final
> > patches.
> > > > >
> > > > > Optabs only, These will be handled through expand rewriting them if 
> > > > > the
> > targets
> > > > > supports them:
> > > > >
> > > > > -- cbranch
> > > > >
> > > > > At the moment we model vector compare and branch using cbranch on
> > vector
> > > > > arguments. e.g.
> > > > >
> > > > > a = b `cmp` c
> > > > > if (a <cmp2> 0)
> > > > >   ...
> > > > >
> > > > > where <cmp2> is != or == reflecting whether we want ANY or ALL.
> > > > >
> > > > > The plan is to deprecate this in favor of an explicit vector compare, 
> > > > > which
> > > > > allows us to elide the != 0 and == 0.  This allows us to be able to 
> > > > > generate
> > > > > more efficient code in the backends.  For instance for Adv. SIMD we 
> > > > > can now
> > use
> > > > > an SVE compare + branch for Adv. SIMD.
> > > > >
> > > > > For this feature 6 new optabs are needed, and cbranch will be made
> > > > unsupported
> > > > > for vector arguments and I will update all backends.
> > > > >
> > > > > -- vec_cbranch_any, vec_cbranch_all
> > > > >
> > > > > These are unpredicated version of the compares.  This means this 
> > > > > version also
> > > > > does not have an ELSE argument as all lanes are relevant.
> > > > >
> > > > > cond_vec_cbranch_any, cond_vec_cbranch_all
> > > > > cond_len_vec_cbranch_any, cond_len_vec_cbranch_all
> > > > >
> > > > > These are the predicated equivalent of the above.  These take the
> > predicate/len
> > > > > of the operation and also takes an ELSE argument for what happens 
> > > > > when the
> > > > > predicate is inactive.  I think the default is zero.  But I'm unsure 
> > > > > how to
> > > > > handle when a target wants something other than zero since this is an 
> > > > > expand
> > > > time
> > > > > rewriting and so I can't reject the expansion.
> > > >
> > > > You are rewriting from a GIMPLE_COND, right?  If so the vectorizer
> > > > should have verified an appropriate expansion exists (I suggest to
> > > > separate a function that selects the optab you can call from both
> > > > expansion and vectorization for this)
> > >
> > > But how do I communicate this preference? I guess we'd need some kind of
> > > target hook? But cond_* variants are an optimization on top of 
> > > vec_cbranch I
> > didn't
> > > think we'd need to check for the cond_ variant in the vectorizer. As we 
> > > have a
> > > default fallback.
> > 
> > Look at how we communicate the else value elsehwere.  As it's not
> > going to be explicit on the GIMPLE_COND I'd hope it will re-analyze
> > the same (so use a common function).
> 
> Thanks for the hint, will read up.
> 
> > 
> > > >
> > > > > -- vec_cmp
> > > > >
> > > > > At the moment vector compares are always unpredicated. Masking or LENs
> > are
> > > > > applied to the result of the unpredicated compare.
> > > > >
> > > > > E.g.
> > > > >
> > > > > a = b `cmp` c
> > > > > d = a & loop_mask
> > > > >
> > > > > and we rely on combine to push the mask into the compare later on.
> > however
> > > > with
> > > > > the introduction of cond_vec_cbranch etc we rely on CSE being able to 
> > > > > CSE
> > > > duplicate
> > > > > compares.  CSE however runs before combine and so we need a mechanism
> > to
> > > > expand
> > > > > the compares as predicated from the start.
> > > > >
> > > > > This also allows us to simplify some backend patterns.  The proposed 
> > > > > optabs
> > are
> > > > >
> > > > > -- cond_vec_cmp, cond_len_vec_cmp
> > > > >
> > > > > which are similar to vec_cmp$a$b such that it takes a condition code 
> > > > > as
> > > > argument
> > > > > and a predicate or len.  This is an optimization and so it's an 
> > > > > optional optab
> > > > > for targets.  A target would need to implement both vec_cmp and
> > cond_vec_cmp
> > > > > to get the optimization.
> > > >
> > > > So like the branch case this needs an else value, just in case this
> > > > wasn't obvious.
> > >
> > > Yep.
> > >
> > > >
> > > > > --- Predicate operations
> > > > >
> > > > > The following ones will get new optabs and IFN and will be generated 
> > > > > by the
> > > > > vectorizer.  At the moment I don't know how the RVV code deals with 
> > > > > similar
> > > > > operations.  I will assume that for their case they work on 
> > > > > masks/predicates
> > as
> > > > > well as that is the result of the vector compares.  Do tell me if I 
> > > > > need a LEN
> > > > > version of these, but I would appreciate links to instructions so I 
> > > > > know how to
> > > > > expect them to be used.
> > > > >
> > > > > These operations are used to allow us to not need the scalar epilogue.
> > > > >
> > > > > -- Predicate partitioning
> > > > >
> > > > > When an early break is vectorized
> > > > >
> > > > > if (a `cmp` b)
> > > > >
> > > > > Then if we want to avoid the scalar the scalar epilogue we would need 
> > > > > to
> > handle
> > > > > such cases as
> > > > >
> > > > > return i
> > > > > return a[i]
> > > > >
> > > > > etc.
> > > > >
> > > > > The first one is return i.  My patches change it such that we now use 
> > > > > a scalar
> > > > > IV for the forced live IVs.  Which means return i becomes
> > > >
> > > > So that's possibly an additional scalar IV, right?
> > >
> > > Yes that's right.
> > >
> > > >
> > > > > mask1 = all_before_first (mask)
> > > > > scalar_iv + count_active (mask1);
> > > >
> > > > So 'mask' is the mask for the if (a cmp b), aka it indicates whether
> > > > we exited for a lane.  So this is ffs (mask) - 1?
> > >
> > > Yeah correct.
> > >
> > > >
> > > > > all_before_first and all_after_first map to the SVE instructions BRKB 
> > > > > and BRKA
> > > > > essentially all_before_first makes all lanes as active in the 
> > > > > predicate until
> > > > > but not including the first active element.
> > > > >
> > > > > So a mask of {0,0,0,1} becomes {1,1,1,0} such that the scalar_iv + 3 
> > > > > which
> > gives
> > > > > us the final value of i.  In SVE the count_active becomes CNTP and x 
> > > > > + CNTP
> > > > > gets optimized to INCP, so we don't need to model INCP as optabs.
> > > > >
> > > > > The mask1 will also be used if the loop had a store, so we can only 
> > > > > do the
> > > > > partials stores outstanding when we exit the loop.  i.e. store 
> > > > > movement
> > would
> > > > > now also duplicate the stores in the exit block and mask them by 
> > > > > mask1.
> > > >
> > > > So the original scalar IV had an evolution { init, +, step }, the
> > > > "vector" IV will have { init, +, VF * step } then, right?
> > >
> > > Yeah, and IVopts is actually pretty good in mixing this new scalar IV 
> > > with the IV
> > > the vectorizer generates as new loop control.  So we actually in most 
> > > cases only
> > > end up having one scalar IV in the end, just with different addressing.
> > >
> > > >  I wonder
> > > > whether in case we have an actual vector IV for the IV (because of
> > > > other uses), the epilog could use i == vector-i[0], so extract
> > > > element zero from a vector IV to compute the scalar i?
> > >
> > > We'd still have a normal scalar IV there wouldn't we?  I assume you're
> > > talking about the case where e.g. you have a[i] = i ?
> > 
> > No, I meant for the return i; case, if we have a vector_i[] then
> > the scalar IV i on exit with the evolution above is vector_i[0], no?
> > If you need the 'i' for the lane that exited you then add ffs (mask) - 1
> > to that.
> 
> Yes, so return i is quite easy, it's
> 
> mask1 = all_before_first (mask)
> return scalar_iv + COUNT_ACTIVE(mask1)
> 
> lets say the result of the compare is
> 
> mask = {0,0,1,1}
> mask1 = {1,1,0,0}
> and so the return value is scalar_iv + 2;
> 
> where scalar_iv is counting the IV at the start of the iteration.
> 
> For SVE, this ends up being scalar_iv + CNTP (mask1)
> Which the backend will then transform into INCP (scalar_iv, mask1)
> 
> > 
> > And yes, we have a scalar counting IV, but currently that counts
> > vector iterations up from zero to N.  I'm not sure IVOPTs can easily
> > relate the two?
> > 
> > > My code atm generates both the vector and scalar IV and the vector IV 
> > > becomes
> > > unused.   I've done this for the reason you outlined above. i.e. that if 
> > > you actually
> > > need the vector IV that the reduction would be potentially better.
> > >
> > > This was also practical because the vectorizer didn't like me generating 
> > > scalar code
> > > only from a node in an SLP tree.  I tried to not do this, but it resulted 
> > > in quite
> > invasive
> > > changes.  So now I just generate the scalar IV as an additional output, 
> > > which was
> > cleaner
> > > and worked nicely, but also gave me the option I described above.
> > 
> > I see.  It's already a bit awkward how we handle fold-left reduction
> > scalar reduction IVs, so I guess expected.  I would expect putting a
> > scalar def as "vector def" works fine, but of course not if you need
> > both, the scalar and the vector one.
> > 
> 
> I haven't tried as a def. Maybe a def could work? But I think for now the 
> flexibility to
> choose might come in handy.  I'm hoping to have it to experiment with codegen 
> and
> if not needed to change it later.   It also makes the patch much simpler to 
> reason about.
> 
> But can investigate this if you prefer?
> 
> > > >
> > > > > return a[i]
> > > > >
> > > > > is handled by using extract_first_active, in SVE these map to the 
> > > > > COMPACT or
> > > > > SPLICE instruction depending on the size of the lanes. The usage 
> > > > > would be
> > > >
> > > > So AVX512 has vcompressp{d,s} and vexpandp{d,s} (but nothing for smaller
> > > > integer element types).  Those could be used for this but they have
> > > > a vector result (and element zero would be the first active).
> > > >
> > >
> > > This is a good point.  COMPACT is actually the same, it returns a vector 
> > > with
> > > all the active elements shuffled down.  Which I think is the same as 
> > > vcompress.
> > >
> > > I could instead make it return a vector and do an explicit vec_select.
> > 
> > Yes.
> > 
> > > > But don't you possibly want the last inactive as well, dependent on
> > > > whether this is a peeled/not peeled exit?  We could shift the mask
> > > > by one for either case.
> > >
> > > Yeah, but I think peeled can be handled with all_including_first
> > > So for {0,0,1,1} it returns {1,1,1,0} and then we can use the existing
> > > EXTRACT_LAST.
> > >
> > > This would be BRKB in SVE.  So this should already be covered in the 
> > > proposal.
> > >
> > > I do realize the name I used before isn't very clear. So lets rename, the 
> > > two
> > > optabs/ifn we need is all_before_first, all_including_first.
> > 
> > What do these do?  Count lanes in a maks?
> 
> No, they're essentially mask partitioning operation.
> Given a mask, {0,0,1,0}
> 
> all_before_first returns {1,1,0,0}
> and all_including_first {1,1,1,0}
> 
> They're used to create a mask for when you're dealing with control flow
> situations where you have to partition the mask into parts you have to still
> execute from one block and parts where you don't.
> 
> so lets say
> 
> a[i] = ..
> if (a[i] > 3)
>   return 0;
> 
> on exit, we still have to perform the operations on a[i] from the mask
> created by all_before_first to maintain the correct side-effects.
> 
> I think from your pervious email, conceptually all_including_first is a
> signed >> 1 on all_before_first, but we can't model it like that due to
> how the requirements of active lanes are in SVE. The top bits could be 0.
> 
> > 
> > > >
> > > > vcompresspd is not available on AVX2 or SSE4.1, using a vector-vector
> > > > permute to get the element 'i' % nunits to lane zero would be another
> > > > possibility, also for non-float or double sized elements we need sth
> > > > like this.
> > > >
> > > > I do wonder whether we want to have the compress/expand as actual
> > > > optabs when we use them.  Having an extract_first (without _active,
> > > > following extract_last_optab) is probably OK to abstract this to
> > > > some extent.  extract_last doesn't specify what happens if no
> > > > bit is set in the mask, fold_extract_last seems to be the same
> > > > but with an else value - I wonder whether we should canonicalize
> > > > those and thus have an else value for extract_first.
> > >
> > > I assume you then mean cond_extract_first? Since we'd definitely
> > > need the mask though to quantify what first is.
> > 
> > extract_last already has a mask argument.  "cond" doesn't really
> > fit here as we only have a single scalar result?  So I'd follow
> > the namign scheme of extract_last (though fold_ is odd here).
> 
> No problem.
> 
> > 
> > > >
> > > > > extract_first_active (vect_a, mask)
> > > > >
> > > > > Note that for this IFN it is undefined what would happen if mask is 
> > > > > all false.
> > > > > This is required to match the SVE semantics of the operations but it's
> > > > > guaranteed the vectorizer does only calls the IFN when at least one 
> > > > > lane is
> > > > > active.
> > > > >
> > > > > I do not believe I need a LEN version here either? But If If I'm 
> > > > > wrong It would
> > > > > be useful to have a small example.
> > > >
> > > > I think you need a len variant unless the mask producer had len
> > > > applied with an else value of 0 (IIRC RVV always preferse 'undefined'
> > > > as else value).  OTOH the "first" element - if one is set and we never
> > > > require 'else' - should work with or without loop masking (with len or
> > > > mask).
> > >
> > > Yeah, it would have, since the mask would always come from a compare
> > > which have len applied to it.  But I'm wondering what the RVV equivalent
> > > here is.
> > >
> > > >
> > > > That said, I do wonder why we have both extract_last and
> > > > fold_extract_last.
> > >
> > > I.. also wondered this.  I remember asking this before :')
> > 
> > It somehow suggests the ISA has two variants?
> 
> I'll investigate.
> 
> Both other than this, are you broadly OK with this proposal?

Yes.  We'll have to see how the scalar IV thing turns out.

> I'm hoping to get something concrete so I can discuss at cauldron and
> not waste your time too much :)

;)

Richard.

> Thanks,
> Tamar
> 
> > 
> > > Tamar
> > >
> > > >
> > > > Richard.
> > > >
> > > > >
> > > > > Thanks,
> > > > > Tamar
> > > > >
> > > >
> > > > --
> > > > Richard Biener <[email protected]>
> > > > SUSE Software Solutions Germany GmbH,
> > > > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG
> > Nuernberg)
> > >
> > 
> > --
> > Richard Biener <[email protected]>
> > SUSE Software Solutions Germany GmbH,
> > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
> 

-- 
Richard Biener <[email protected]>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Reply via email to