On Thu, 11 Sep 2025, Tamar Christina wrote:
> This email is to discuss the optabs and IFNs that I will need to finish off
> the
> early break optimizations in GCC 16 and to minimize respins on the final
> patches.
>
> Optabs only, These will be handled through expand rewriting them if the
> targets
> supports them:
>
> -- cbranch
>
> At the moment we model vector compare and branch using cbranch on vector
> arguments. e.g.
>
> a = b `cmp` c
> if (a <cmp2> 0)
> ...
>
> where <cmp2> is != or == reflecting whether we want ANY or ALL.
>
> The plan is to deprecate this in favor of an explicit vector compare, which
> allows us to elide the != 0 and == 0. This allows us to be able to generate
> more efficient code in the backends. For instance for Adv. SIMD we can now
> use
> an SVE compare + branch for Adv. SIMD.
>
> For this feature 6 new optabs are needed, and cbranch will be made unsupported
> for vector arguments and I will update all backends.
>
> -- vec_cbranch_any, vec_cbranch_all
>
> These are unpredicated version of the compares. This means this version also
> does not have an ELSE argument as all lanes are relevant.
>
> cond_vec_cbranch_any, cond_vec_cbranch_all
> cond_len_vec_cbranch_any, cond_len_vec_cbranch_all
>
> These are the predicated equivalent of the above. These take the
> predicate/len
> of the operation and also takes an ELSE argument for what happens when the
> predicate is inactive. I think the default is zero. But I'm unsure how to
> handle when a target wants something other than zero since this is an expand
> time
> rewriting and so I can't reject the expansion.
You are rewriting from a GIMPLE_COND, right? If so the vectorizer
should have verified an appropriate expansion exists (I suggest to
separate a function that selects the optab you can call from both
expansion and vectorization for this)
> -- vec_cmp
>
> At the moment vector compares are always unpredicated. Masking or LENs are
> applied to the result of the unpredicated compare.
>
> E.g.
>
> a = b `cmp` c
> d = a & loop_mask
>
> and we rely on combine to push the mask into the compare later on. however
> with
> the introduction of cond_vec_cbranch etc we rely on CSE being able to CSE
> duplicate
> compares. CSE however runs before combine and so we need a mechanism to
> expand
> the compares as predicated from the start.
>
> This also allows us to simplify some backend patterns. The proposed optabs
> are
>
> -- cond_vec_cmp, cond_len_vec_cmp
>
> which are similar to vec_cmp$a$b such that it takes a condition code as
> argument
> and a predicate or len. This is an optimization and so it's an optional optab
> for targets. A target would need to implement both vec_cmp and cond_vec_cmp
> to get the optimization.
So like the branch case this needs an else value, just in case this
wasn't obvious.
> --- Predicate operations
>
> The following ones will get new optabs and IFN and will be generated by the
> vectorizer. At the moment I don't know how the RVV code deals with similar
> operations. I will assume that for their case they work on masks/predicates
> as
> well as that is the result of the vector compares. Do tell me if I need a LEN
> version of these, but I would appreciate links to instructions so I know how
> to
> expect them to be used.
>
> These operations are used to allow us to not need the scalar epilogue.
>
> -- Predicate partitioning
>
> When an early break is vectorized
>
> if (a `cmp` b)
>
> Then if we want to avoid the scalar the scalar epilogue we would need to
> handle
> such cases as
>
> return i
> return a[i]
>
> etc.
>
> The first one is return i. My patches change it such that we now use a scalar
> IV for the forced live IVs. Which means return i becomes
So that's possibly an additional scalar IV, right?
> mask1 = all_before_first (mask)
> scalar_iv + count_active (mask1);
So 'mask' is the mask for the if (a cmp b), aka it indicates whether
we exited for a lane. So this is ffs (mask) - 1?
> all_before_first and all_after_first map to the SVE instructions BRKB and BRKA
> essentially all_before_first makes all lanes as active in the predicate until
> but not including the first active element.
>
> So a mask of {0,0,0,1} becomes {1,1,1,0} such that the scalar_iv + 3 which
> gives
> us the final value of i. In SVE the count_active becomes CNTP and x + CNTP
> gets optimized to INCP, so we don't need to model INCP as optabs.
>
> The mask1 will also be used if the loop had a store, so we can only do the
> partials stores outstanding when we exit the loop. i.e. store movement would
> now also duplicate the stores in the exit block and mask them by mask1.
So the original scalar IV had an evolution { init, +, step }, the
"vector" IV will have { init, +, VF * step } then, right? I wonder
whether in case we have an actual vector IV for the IV (because of
other uses), the epilog could use i == vector-i[0], so extract
element zero from a vector IV to compute the scalar i?
> return a[i]
>
> is handled by using extract_first_active, in SVE these map to the COMPACT or
> SPLICE instruction depending on the size of the lanes. The usage would be
So AVX512 has vcompressp{d,s} and vexpandp{d,s} (but nothing for smaller
integer element types). Those could be used for this but they have
a vector result (and element zero would be the first active).
But don't you possibly want the last inactive as well, dependent on
whether this is a peeled/not peeled exit? We could shift the mask
by one for either case.
vcompresspd is not available on AVX2 or SSE4.1, using a vector-vector
permute to get the element 'i' % nunits to lane zero would be another
possibility, also for non-float or double sized elements we need sth
like this.
I do wonder whether we want to have the compress/expand as actual
optabs when we use them. Having an extract_first (without _active,
following extract_last_optab) is probably OK to abstract this to
some extent. extract_last doesn't specify what happens if no
bit is set in the mask, fold_extract_last seems to be the same
but with an else value - I wonder whether we should canonicalize
those and thus have an else value for extract_first.
> extract_first_active (vect_a, mask)
>
> Note that for this IFN it is undefined what would happen if mask is all false.
> This is required to match the SVE semantics of the operations but it's
> guaranteed the vectorizer does only calls the IFN when at least one lane is
> active.
>
> I do not believe I need a LEN version here either? But If If I'm wrong It
> would
> be useful to have a small example.
I think you need a len variant unless the mask producer had len
applied with an else value of 0 (IIRC RVV always preferse 'undefined'
as else value). OTOH the "first" element - if one is set and we never
require 'else' - should work with or without loop masking (with len or
mask).
That said, I do wonder why we have both extract_last and
fold_extract_last.
Richard.
>
> Thanks,
> Tamar
>
--
Richard Biener <[email protected]>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)