On Fri, 12 Sep 2025, Tamar Christina wrote: > > -----Original Message----- > > From: Richard Biener <[email protected]> > > Sent: Friday, September 12, 2025 1:40 PM > > To: Tamar Christina <[email protected]> > > Cc: Robin Dapp <[email protected]>; GCC Patches <gcc- > > [email protected]>; [email protected] > > Subject: RE: New optabs and IFN required for early break [bikeshed] > > > > On Thu, 11 Sep 2025, Tamar Christina wrote: > > > > > > -----Original Message----- > > > > From: Richard Biener <[email protected]> > > > > Sent: Thursday, September 11, 2025 12:56 PM > > > > To: Tamar Christina <[email protected]> > > > > Cc: Robin Dapp <[email protected]>; GCC Patches <gcc- > > > > [email protected]>; [email protected] > > > > Subject: Re: New optabs and IFN required for early break [bikeshed] > > > > > > > > On Thu, 11 Sep 2025, Tamar Christina wrote: > > > > > > > > > This email is to discuss the optabs and IFNs that I will need to > > > > > finish off the > > > > > early break optimizations in GCC 16 and to minimize respins on the > > > > > final > > patches. > > > > > > > > > > Optabs only, These will be handled through expand rewriting them if > > > > > the > > targets > > > > > supports them: > > > > > > > > > > -- cbranch > > > > > > > > > > At the moment we model vector compare and branch using cbranch on > > vector > > > > > arguments. e.g. > > > > > > > > > > a = b `cmp` c > > > > > if (a <cmp2> 0) > > > > > ... > > > > > > > > > > where <cmp2> is != or == reflecting whether we want ANY or ALL. > > > > > > > > > > The plan is to deprecate this in favor of an explicit vector compare, > > > > > which > > > > > allows us to elide the != 0 and == 0. This allows us to be able to > > > > > generate > > > > > more efficient code in the backends. For instance for Adv. SIMD we > > > > > can now > > use > > > > > an SVE compare + branch for Adv. SIMD. > > > > > > > > > > For this feature 6 new optabs are needed, and cbranch will be made > > > > unsupported > > > > > for vector arguments and I will update all backends. > > > > > > > > > > -- vec_cbranch_any, vec_cbranch_all > > > > > > > > > > These are unpredicated version of the compares. This means this > > > > > version also > > > > > does not have an ELSE argument as all lanes are relevant. > > > > > > > > > > cond_vec_cbranch_any, cond_vec_cbranch_all > > > > > cond_len_vec_cbranch_any, cond_len_vec_cbranch_all > > > > > > > > > > These are the predicated equivalent of the above. These take the > > predicate/len > > > > > of the operation and also takes an ELSE argument for what happens > > > > > when the > > > > > predicate is inactive. I think the default is zero. But I'm unsure > > > > > how to > > > > > handle when a target wants something other than zero since this is an > > > > > expand > > > > time > > > > > rewriting and so I can't reject the expansion. > > > > > > > > You are rewriting from a GIMPLE_COND, right? If so the vectorizer > > > > should have verified an appropriate expansion exists (I suggest to > > > > separate a function that selects the optab you can call from both > > > > expansion and vectorization for this) > > > > > > But how do I communicate this preference? I guess we'd need some kind of > > > target hook? But cond_* variants are an optimization on top of > > > vec_cbranch I > > didn't > > > think we'd need to check for the cond_ variant in the vectorizer. As we > > > have a > > > default fallback. > > > > Look at how we communicate the else value elsehwere. As it's not > > going to be explicit on the GIMPLE_COND I'd hope it will re-analyze > > the same (so use a common function). > > Thanks for the hint, will read up. > > > > > > > > > > > > -- vec_cmp > > > > > > > > > > At the moment vector compares are always unpredicated. Masking or LENs > > are > > > > > applied to the result of the unpredicated compare. > > > > > > > > > > E.g. > > > > > > > > > > a = b `cmp` c > > > > > d = a & loop_mask > > > > > > > > > > and we rely on combine to push the mask into the compare later on. > > however > > > > with > > > > > the introduction of cond_vec_cbranch etc we rely on CSE being able to > > > > > CSE > > > > duplicate > > > > > compares. CSE however runs before combine and so we need a mechanism > > to > > > > expand > > > > > the compares as predicated from the start. > > > > > > > > > > This also allows us to simplify some backend patterns. The proposed > > > > > optabs > > are > > > > > > > > > > -- cond_vec_cmp, cond_len_vec_cmp > > > > > > > > > > which are similar to vec_cmp$a$b such that it takes a condition code > > > > > as > > > > argument > > > > > and a predicate or len. This is an optimization and so it's an > > > > > optional optab > > > > > for targets. A target would need to implement both vec_cmp and > > cond_vec_cmp > > > > > to get the optimization. > > > > > > > > So like the branch case this needs an else value, just in case this > > > > wasn't obvious. > > > > > > Yep. > > > > > > > > > > > > --- Predicate operations > > > > > > > > > > The following ones will get new optabs and IFN and will be generated > > > > > by the > > > > > vectorizer. At the moment I don't know how the RVV code deals with > > > > > similar > > > > > operations. I will assume that for their case they work on > > > > > masks/predicates > > as > > > > > well as that is the result of the vector compares. Do tell me if I > > > > > need a LEN > > > > > version of these, but I would appreciate links to instructions so I > > > > > know how to > > > > > expect them to be used. > > > > > > > > > > These operations are used to allow us to not need the scalar epilogue. > > > > > > > > > > -- Predicate partitioning > > > > > > > > > > When an early break is vectorized > > > > > > > > > > if (a `cmp` b) > > > > > > > > > > Then if we want to avoid the scalar the scalar epilogue we would need > > > > > to > > handle > > > > > such cases as > > > > > > > > > > return i > > > > > return a[i] > > > > > > > > > > etc. > > > > > > > > > > The first one is return i. My patches change it such that we now use > > > > > a scalar > > > > > IV for the forced live IVs. Which means return i becomes > > > > > > > > So that's possibly an additional scalar IV, right? > > > > > > Yes that's right. > > > > > > > > > > > > mask1 = all_before_first (mask) > > > > > scalar_iv + count_active (mask1); > > > > > > > > So 'mask' is the mask for the if (a cmp b), aka it indicates whether > > > > we exited for a lane. So this is ffs (mask) - 1? > > > > > > Yeah correct. > > > > > > > > > > > > all_before_first and all_after_first map to the SVE instructions BRKB > > > > > and BRKA > > > > > essentially all_before_first makes all lanes as active in the > > > > > predicate until > > > > > but not including the first active element. > > > > > > > > > > So a mask of {0,0,0,1} becomes {1,1,1,0} such that the scalar_iv + 3 > > > > > which > > gives > > > > > us the final value of i. In SVE the count_active becomes CNTP and x > > > > > + CNTP > > > > > gets optimized to INCP, so we don't need to model INCP as optabs. > > > > > > > > > > The mask1 will also be used if the loop had a store, so we can only > > > > > do the > > > > > partials stores outstanding when we exit the loop. i.e. store > > > > > movement > > would > > > > > now also duplicate the stores in the exit block and mask them by > > > > > mask1. > > > > > > > > So the original scalar IV had an evolution { init, +, step }, the > > > > "vector" IV will have { init, +, VF * step } then, right? > > > > > > Yeah, and IVopts is actually pretty good in mixing this new scalar IV > > > with the IV > > > the vectorizer generates as new loop control. So we actually in most > > > cases only > > > end up having one scalar IV in the end, just with different addressing. > > > > > > > I wonder > > > > whether in case we have an actual vector IV for the IV (because of > > > > other uses), the epilog could use i == vector-i[0], so extract > > > > element zero from a vector IV to compute the scalar i? > > > > > > We'd still have a normal scalar IV there wouldn't we? I assume you're > > > talking about the case where e.g. you have a[i] = i ? > > > > No, I meant for the return i; case, if we have a vector_i[] then > > the scalar IV i on exit with the evolution above is vector_i[0], no? > > If you need the 'i' for the lane that exited you then add ffs (mask) - 1 > > to that. > > Yes, so return i is quite easy, it's > > mask1 = all_before_first (mask) > return scalar_iv + COUNT_ACTIVE(mask1) > > lets say the result of the compare is > > mask = {0,0,1,1} > mask1 = {1,1,0,0} > and so the return value is scalar_iv + 2; > > where scalar_iv is counting the IV at the start of the iteration. > > For SVE, this ends up being scalar_iv + CNTP (mask1) > Which the backend will then transform into INCP (scalar_iv, mask1) > > > > > And yes, we have a scalar counting IV, but currently that counts > > vector iterations up from zero to N. I'm not sure IVOPTs can easily > > relate the two? > > > > > My code atm generates both the vector and scalar IV and the vector IV > > > becomes > > > unused. I've done this for the reason you outlined above. i.e. that if > > > you actually > > > need the vector IV that the reduction would be potentially better. > > > > > > This was also practical because the vectorizer didn't like me generating > > > scalar code > > > only from a node in an SLP tree. I tried to not do this, but it resulted > > > in quite > > invasive > > > changes. So now I just generate the scalar IV as an additional output, > > > which was > > cleaner > > > and worked nicely, but also gave me the option I described above. > > > > I see. It's already a bit awkward how we handle fold-left reduction > > scalar reduction IVs, so I guess expected. I would expect putting a > > scalar def as "vector def" works fine, but of course not if you need > > both, the scalar and the vector one. > > > > I haven't tried as a def. Maybe a def could work? But I think for now the > flexibility to > choose might come in handy. I'm hoping to have it to experiment with codegen > and > if not needed to change it later. It also makes the patch much simpler to > reason about. > > But can investigate this if you prefer? > > > > > > > > > > return a[i] > > > > > > > > > > is handled by using extract_first_active, in SVE these map to the > > > > > COMPACT or > > > > > SPLICE instruction depending on the size of the lanes. The usage > > > > > would be > > > > > > > > So AVX512 has vcompressp{d,s} and vexpandp{d,s} (but nothing for smaller > > > > integer element types). Those could be used for this but they have > > > > a vector result (and element zero would be the first active). > > > > > > > > > > This is a good point. COMPACT is actually the same, it returns a vector > > > with > > > all the active elements shuffled down. Which I think is the same as > > > vcompress. > > > > > > I could instead make it return a vector and do an explicit vec_select. > > > > Yes. > > > > > > But don't you possibly want the last inactive as well, dependent on > > > > whether this is a peeled/not peeled exit? We could shift the mask > > > > by one for either case. > > > > > > Yeah, but I think peeled can be handled with all_including_first > > > So for {0,0,1,1} it returns {1,1,1,0} and then we can use the existing > > > EXTRACT_LAST. > > > > > > This would be BRKB in SVE. So this should already be covered in the > > > proposal. > > > > > > I do realize the name I used before isn't very clear. So lets rename, the > > > two > > > optabs/ifn we need is all_before_first, all_including_first. > > > > What do these do? Count lanes in a maks? > > No, they're essentially mask partitioning operation. > Given a mask, {0,0,1,0} > > all_before_first returns {1,1,0,0} > and all_including_first {1,1,1,0} > > They're used to create a mask for when you're dealing with control flow > situations where you have to partition the mask into parts you have to still > execute from one block and parts where you don't. > > so lets say > > a[i] = .. > if (a[i] > 3) > return 0; > > on exit, we still have to perform the operations on a[i] from the mask > created by all_before_first to maintain the correct side-effects. > > I think from your pervious email, conceptually all_including_first is a > signed >> 1 on all_before_first, but we can't model it like that due to > how the requirements of active lanes are in SVE. The top bits could be 0. > > > > > > > > > > > vcompresspd is not available on AVX2 or SSE4.1, using a vector-vector > > > > permute to get the element 'i' % nunits to lane zero would be another > > > > possibility, also for non-float or double sized elements we need sth > > > > like this. > > > > > > > > I do wonder whether we want to have the compress/expand as actual > > > > optabs when we use them. Having an extract_first (without _active, > > > > following extract_last_optab) is probably OK to abstract this to > > > > some extent. extract_last doesn't specify what happens if no > > > > bit is set in the mask, fold_extract_last seems to be the same > > > > but with an else value - I wonder whether we should canonicalize > > > > those and thus have an else value for extract_first. > > > > > > I assume you then mean cond_extract_first? Since we'd definitely > > > need the mask though to quantify what first is. > > > > extract_last already has a mask argument. "cond" doesn't really > > fit here as we only have a single scalar result? So I'd follow > > the namign scheme of extract_last (though fold_ is odd here). > > No problem. > > > > > > > > > > > > extract_first_active (vect_a, mask) > > > > > > > > > > Note that for this IFN it is undefined what would happen if mask is > > > > > all false. > > > > > This is required to match the SVE semantics of the operations but it's > > > > > guaranteed the vectorizer does only calls the IFN when at least one > > > > > lane is > > > > > active. > > > > > > > > > > I do not believe I need a LEN version here either? But If If I'm > > > > > wrong It would > > > > > be useful to have a small example. > > > > > > > > I think you need a len variant unless the mask producer had len > > > > applied with an else value of 0 (IIRC RVV always preferse 'undefined' > > > > as else value). OTOH the "first" element - if one is set and we never > > > > require 'else' - should work with or without loop masking (with len or > > > > mask). > > > > > > Yeah, it would have, since the mask would always come from a compare > > > which have len applied to it. But I'm wondering what the RVV equivalent > > > here is. > > > > > > > > > > > That said, I do wonder why we have both extract_last and > > > > fold_extract_last. > > > > > > I.. also wondered this. I remember asking this before :') > > > > It somehow suggests the ISA has two variants? > > I'll investigate. > > Both other than this, are you broadly OK with this proposal?
Yes. We'll have to see how the scalar IV thing turns out. > I'm hoping to get something concrete so I can discuss at cauldron and > not waste your time too much :) ;) Richard. > Thanks, > Tamar > > > > > > Tamar > > > > > > > > > > > Richard. > > > > > > > > > > > > > > Thanks, > > > > > Tamar > > > > > > > > > > > > > -- > > > > Richard Biener <[email protected]> > > > > SUSE Software Solutions Germany GmbH, > > > > Frankenstrasse 146, 90461 Nuernberg, Germany; > > > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG > > Nuernberg) > > > > > > > -- > > Richard Biener <[email protected]> > > SUSE Software Solutions Germany GmbH, > > Frankenstrasse 146, 90461 Nuernberg, Germany; > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg) > -- Richard Biener <[email protected]> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
