"Kewen.Lin" <li...@linux.ibm.com> writes: > on 2020/5/26 下午5:44, Richard Biener wrote: >> On Tue, 26 May 2020, Kewen.Lin wrote: >> >>> Hi Richi, >>> >>> on 2020/5/26 下午3:12, Richard Biener wrote: >>>> On Tue, 26 May 2020, Kewen.Lin wrote: >>>> >>>>> Hi all, >>>>> >>>>> This patch set adds support for vector load/store with length, Power >>>>> ISA 3.0 brings instructions lxvl/stxvl to perform vector load/store with >>>>> length, it's good to be exploited for those cases we don't have enough >>>>> stuffs to fill in the whole vector like epilogues. >>>>> >>>>> This support mainly refers to the handlings for fully-predicated loop >>>>> but it also covers the epilogue usage. Now it supports two modes >>>>> controlled by parameter vect-with-length-scope, it can support any >>>>> loops fully with length or just for those cases with small iteration >>>>> counts less than VF like epilogue, for now I don't have ready env to >>>>> benchmark it, but based on the current inefficient length generation, >>>>> I don't think it's a good idea to adopt vector with length for any loops. >>>>> For the main loop which used to be vectorized, it increases register >>>>> pressure and introduces extra computation for length, the pro for icache >>>>> seems not comparable. But I think it might be a good idea to keep this >>>>> parameter there for functionality testing, further benchmarking and other >>>>> ports' potential future supports. >>>> >>>> Can you explain in more detail what "vector load/store with length" does? >>>> Is that a "simplified" masked load/store which instead of masking >>>> arbitrary elements (and need a mask computed in the first place), masks >>>> elements > N (the length operand)? Thus assuming a lane IV decrementing >>>> to zero that IV would be the natural argument for the length operand? >>>> If that's correct, what data are the remaining lanes filled with? >>>> >>> >>> The vector load/store have one GPR holding one length in bytes (called as >>> n here) to control how many bytes we want to load/store. If n > vector_size >>> (on Power it's 16), n will be taken as 16, if n is zero, the storage access >>> won't happen, the result for load is vector zero, if n > 0 but < >>> vector_size, >>> the remaining lanes will be filled with zero. On Power, it checks 0:7 bits >>> of the length GPR, so the length should be adjusted. >>> >>> Your understanding is correct! It's a "simplified" masked load/store, need >>> the length in bytes computed, only support continuous access. For the lane >>> IV, the one should multiply with the lane bytes and be adjusted if needed. >>> >>>> From a look at the series description below you seem to add a new way >>>> of doing loads for this. Did you review other ISAs (those I'm not >>>> familiar with myself too much are SVE, RISC-V and GCN) in GCC whether >>>> they have similar support and whether your approach can be supported >>>> there? ISTR SVE must have some similar support - what's the reason >>>> you do not piggy-back on that? >>> >>> I may miss something, but I didn't find there is any support that meet with >>> this in GCC. Good suggestion on ISAs, I didn't review other ISAs, for the >>> current implementation, I heavily referred to existing SVE fully predicated >>> loop support, it's stronger than "with length", it can deal with arbitrary >>> elements, only perform for the loop fully etc. >>> >>>> >>>> I think a load like I described above might be represented as >>>> >>>> _1 = __VIEW_CONVERT <v4df_t> (__MEM <double[n_2]> ((double *)p_3)); >>>> >>>> not sure if that actually works out though. But given it seems it >>>> is a contiguous load we shouldn't need an internal function here? >>>> [there's a possible size mismatch in the __VIEW_CONVERT above, I guess >>>> on RTL you end up with a paradoxical subreg - or an UNSPEC] >>> >>> IIUC, what you suggested is to avoid use IFN here. May I know the reason? >>> is it due to the maintenance efforts on various IFNs? I thought using >>> IFN is gentle. And agreed, I had the concern that the size mismatch >>> problem here since the length can be large (extremely large probably, it >>> depends on target saturated limit), the converted vector size can be small. >>> Besides, the length can be a variable. >>> >>>> >>>> That said, I'm not very happy seeing yet another way of doing loads >>>> [for fully predicated loops]. I'd rather like to see a single >>>> representation on GIMPLE at least. >>> >>> OK. Does it mean IFN isn't counted into this scope? :) >> >> Sure going with a new IFN is a possibility. I'd just avoid that if >> possible ;) How does SVE represent loads/stores for whilelt loops? >> Would it not be possible to share that representation somehow? >> >> Richard. >> > > Got it. :) Currently SVE uses IFNs .MASK_LOAD and .MASK_STORE for > whileult loops: > > vect__1.5_14 = .MASK_LOAD (_11, 4B, loop_mask_9); > ... > .MASK_STORE (_1, 4B, loop_mask_9, vect__3.9_18); > ... > next_mask_26 = .WHILE_ULT (_2, 127, { 0, ... }); > if (next_mask_26 != { 0, ... }) > > I thought to share it once, but didn't feel it's a good idea since it's > like we take something as masks that isn't actually masks, easily confuse > people. OTOH, Power or other potential targets probably supports these > kinds of masking instructions, easily to confuse people further. > > But I definitely agree that we can refactor some codes to share with > each other when possible. > > Btw, here the code with proposed IFNs looks like: > > vect__1.5_4 = .LEN_LOAD (_15, 4B, loop_len_5); > ... > .LEN_STORE (_1, 4B, loop_len_5, vect__3.9_18); > ivtmp_23 = ivtmp_22 + 16; > _25 = MIN_EXPR <ivtmp_23, 508>; > _26 = 508 - _25; > _27 = MIN_EXPR <_26, 16>; > if (ivtmp_23 <= 507)
FWIW, I agree adding .LEN_LOAD and .LEN_STORE seems like a good approach. I think it'll be more maintainable in the long run than trying to have .MASK_LOADs and .MASK_STOREs that need a special mask operand. (That would be too similar to VEC_COND_EXPR :-)) Not sure yet what the exact semantics wrt out-of-range values for the IFN/optab though. Maybe we should instead have some kind of abstract, target-specific cookie created by a separate intrinsic. Haven't thought much about it yet... Thanks, Richard