Richard Biener <rguent...@suse.de> writes: > On Wed, 12 Apr 2023, juzhe.zh...@rivai.ai wrote: > >> >> >> Thanks for the detailed explanation. Just to clarify - with RVV >> >> there's only a single mask register, v0.t, or did you want to >> >> say an instruction can only specify a single mask register? >> >> RVV has 32 (v0~v31) vector register in total. >> We can store vector data value or mask value in any of them. >> We also have mask-logic instruction for example mask-and between any vector >> register. >> >> However, any vector operation for example like vadd.vv can only predicated >> by v0 (in asm is v0.t) which is the first vector register. >> We can predicate vadd.vv with v1 - v31. >> >> So, you can image every time we want to use a mask to predicate a vector >> operation, we should always first store the mask value >> into v0. >> >> So, we can write intrinsic sequence like this: >> >> vmseq v0,v8,v9 (store mask value to v0) >> vmslt v1,v10,v11 (store mask value to v1) >> vmand v0,v0,v1 >> vadd.vv ...v0.t (predicate mask should always be mask). > > Ah, I see - that explains it well. > >> >> ARM SVE would have a loop control mask and a separate mask >> >> for the if (cond[i]) which would be combined with a mask-and >> >> instruction to a third mask which is then used on the >> >> predicated instructions. >> >> Yeah, I know it. ARM SVE way is a more elegant way than RVV do. >> However, for RVV, we can't follow this flow. >> We don't have a "whilelo" instruction to generate loop control mask. > > Yep. Similar for AVX512 where I have to use a vector compare. I'm > currently using > > { 0, 1, 2 ... } < { remaining_len, remaining_len, ... } > > and careful updating of remaining_len (we know it will either > be adjusted by the full constant vector length or updated to zero). > >> We only can do loop control with length generated by vsetvl. >> And we can only use "v0" to mask predicate vadd.vv, and mask value can only >> generated by comparison or mask logical instructions. >> >> >> PowerPC and s390x might be able to use WHILE_LEN as well (though >> >> they only have LEN variants of loads and stores) - of course >> >> only "simulating it". For the fixed-vector-length ISAs the >> >> predicated vector loop IMHO makes most sense for the epilogue to >> >> handle low-trip loops better. >> >> Yeah, I wonder how they do the flow control (if (cond[i])). >> For RVV, you can image I will need to add a pattern >> LEN_MASK_LOAD/LEN_MASK_STORE (length generated by WHILE_LEN and mask >> generated by comparison) >> >> I think we can CC IBM folks to see whether we can make WHILE_LEN works >> for both IBM and RVV ? > > I've CCed them. Adding WHILE_LEN support to rs6000/s390x would be > mainly the "easy" way to get len-masked (epilog) loop support.
I think that already works for them (could be misremembering). However, IIUC, they have no special instruction to calculate the length (unlike for RVV), and so it's open-coded using vect_get_len. I suppose my two questions are: (1) How easy would it be to express WHILE_LEN in normal gimple? I haven't thought about this at all, so the answer might be "very hard". But it reminds me a little of UQDEC on AArch64, which we open-code using MAX_EXPR and MINUS_EXPR (see vect_set_loop_controls_directly). I'm not saying WHILE_LEN is the same operation, just that it seems like it might be open-codeable in a similar way. Even if we can open-code it, we'd still need some way for the target to select the "RVV way" from the "s390/PowerPC way". (2) What effect does using a variable IV step (the result of the WHILE_LEN) have on ivopts? I remember experimenting with something similar once (can't remember the context) and not having a constant step prevented ivopts from making good addresing-mode choices. Thanks, Richard