On Wed, Apr 12, 2023 at 1:18 PM Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> wrote: > > Richard Biener <rguent...@suse.de> writes: > > On Wed, 12 Apr 2023, juzhe.zh...@rivai.ai wrote: > > > >> > >> >> Thanks for the detailed explanation. Just to clarify - with RVV > >> >> there's only a single mask register, v0.t, or did you want to > >> >> say an instruction can only specify a single mask register? > >> > >> RVV has 32 (v0~v31) vector register in total. > >> We can store vector data value or mask value in any of them. > >> We also have mask-logic instruction for example mask-and between any > >> vector register. > >> > >> However, any vector operation for example like vadd.vv can only > >> predicated by v0 (in asm is v0.t) which is the first vector register. > >> We can predicate vadd.vv with v1 - v31. > >> > >> So, you can image every time we want to use a mask to predicate a vector > >> operation, we should always first store the mask value > >> into v0. > >> > >> So, we can write intrinsic sequence like this: > >> > >> vmseq v0,v8,v9 (store mask value to v0) > >> vmslt v1,v10,v11 (store mask value to v1) > >> vmand v0,v0,v1 > >> vadd.vv ...v0.t (predicate mask should always be mask). > > > > Ah, I see - that explains it well. > > > >> >> ARM SVE would have a loop control mask and a separate mask > >> >> for the if (cond[i]) which would be combined with a mask-and > >> >> instruction to a third mask which is then used on the > >> >> predicated instructions. > >> > >> Yeah, I know it. ARM SVE way is a more elegant way than RVV do. > >> However, for RVV, we can't follow this flow. > >> We don't have a "whilelo" instruction to generate loop control mask. > > > > Yep. Similar for AVX512 where I have to use a vector compare. I'm > > currently using > > > > { 0, 1, 2 ... } < { remaining_len, remaining_len, ... } > > > > and careful updating of remaining_len (we know it will either > > be adjusted by the full constant vector length or updated to zero). > > > >> We only can do loop control with length generated by vsetvl. > >> And we can only use "v0" to mask predicate vadd.vv, and mask value can > >> only generated by comparison or mask logical instructions. > >> > >> >> PowerPC and s390x might be able to use WHILE_LEN as well (though > >> >> they only have LEN variants of loads and stores) - of course > >> >> only "simulating it". For the fixed-vector-length ISAs the > >> >> predicated vector loop IMHO makes most sense for the epilogue to > >> >> handle low-trip loops better. > >> > >> Yeah, I wonder how they do the flow control (if (cond[i])). > >> For RVV, you can image I will need to add a pattern > >> LEN_MASK_LOAD/LEN_MASK_STORE (length generated by WHILE_LEN and mask > >> generated by comparison) > >> > >> I think we can CC IBM folks to see whether we can make WHILE_LEN works > >> for both IBM and RVV ? > > > > I've CCed them. Adding WHILE_LEN support to rs6000/s390x would be > > mainly the "easy" way to get len-masked (epilog) loop support. > > I think that already works for them (could be misremembering). > However, IIUC, they have no special instruction to calculate the > length (unlike for RVV), and so it's open-coded using vect_get_len. > > I suppose my two questions are: > > (1) How easy would it be to express WHILE_LEN in normal gimple? > I haven't thought about this at all, so the answer might be > "very hard". But it reminds me a little of UQDEC on AArch64, > which we open-code using MAX_EXPR and MINUS_EXPR (see > vect_set_loop_controls_directly). > > I'm not saying WHILE_LEN is the same operation, just that it seems > like it might be open-codeable in a similar way.
I think WHILE_LEN is saturate-to-zero subtraction. So when the IV can be expressed signed remain = MAX (0, remain - vf); the details are more complicated then you need an unsigned IV. It might be that WHILE_LEN for RVV computes remain % VL so another MIN around (not sure). For the AVX512 work I also have a scalar 'remain' like above but currently I'm adding a branch do { if (remain < vf) mask = ... vector compare ..; else mask = all-ones; } while (mask-not-all-zeros); so I'm using the mask as control "IV". But that's because I do open-code WHILE_ULT at RTL expansion time and this is how the vectorizer works for SVE. When manually creating a loop mask in the vectorizer tracking 'remain' is easier. Note the extra control flow complicates the fully masked variant, for the epilog we know remain < vf and that we'll immediately exit the loop. > Even if we can open-code it, we'd still need some way for the > target to select the "RVV way" from the "s390/PowerPC way". > > (2) What effect does using a variable IV step (the result of > the WHILE_LEN) have on ivopts? I remember experimenting with > something similar once (can't remember the context) and not > having a constant step prevented ivopts from making good > addresing-mode choices. Any kind of variable length stuff (WHILE_ULT or WHILE_LEN) will probably make niter analysis fail. All IV uses that are not SCEV analyzable will just remain as-is as IVOPTs cannot deal with them either - but usually that should be only the control IV. Richard. > > Thanks, > Richard > >