"Kewen.Lin" <li...@linux.ibm.com> writes:
> on 2020/5/26 下午5:44, Richard Biener wrote:
>> On Tue, 26 May 2020, Kewen.Lin wrote:
>> 
>>> Hi Richi,
>>>
>>> on 2020/5/26 下午3:12, Richard Biener wrote:
>>>> On Tue, 26 May 2020, Kewen.Lin wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> This patch set adds support for vector load/store with length, Power 
>>>>> ISA 3.0 brings instructions lxvl/stxvl to perform vector load/store with
>>>>> length, it's good to be exploited for those cases we don't have enough
>>>>> stuffs to fill in the whole vector like epilogues.
>>>>>
>>>>> This support mainly refers to the handlings for fully-predicated loop
>>>>> but it also covers the epilogue usage.  Now it supports two modes
>>>>> controlled by parameter vect-with-length-scope, it can support any
>>>>> loops fully with length or just for those cases with small iteration
>>>>> counts less than VF like epilogue, for now I don't have ready env to
>>>>> benchmark it, but based on the current inefficient length generation,
>>>>> I don't think it's a good idea to adopt vector with length for any loops.
>>>>> For the main loop which used to be vectorized, it increases register
>>>>> pressure and introduces extra computation for length, the pro for icache
>>>>> seems not comparable.  But I think it might be a good idea to keep this
>>>>> parameter there for functionality testing, further benchmarking and other
>>>>> ports' potential future supports.
>>>>
>>>> Can you explain in more detail what "vector load/store with length" does?
>>>> Is that a "simplified" masked load/store which instead of masking 
>>>> arbitrary elements (and need a mask computed in the first place), masks
>>>> elements > N (the length operand)?  Thus assuming a lane IV decrementing
>>>> to zero that IV would be the natural argument for the length operand?
>>>> If that's correct, what data are the remaining lanes filled with?
>>>>
>>>
>>> The vector load/store have one GPR holding one length in bytes (called as
>>> n here) to control how many bytes we want to load/store.  If n > vector_size
>>> (on Power it's 16), n will be taken as 16, if n is zero, the storage access
>>> won't happen, the result for load is vector zero, if n > 0 but < 
>>> vector_size,
>>> the remaining lanes will be filled with zero.  On Power, it checks 0:7 bits
>>> of the length GPR, so the length should be adjusted.
>>>
>>> Your understanding is correct!  It's a "simplified" masked load/store, need
>>> the length in bytes computed, only support continuous access.  For the lane
>>> IV, the one should multiply with the lane bytes and be adjusted if needed.
>>>
>>>> From a look at the series description below you seem to add a new way
>>>> of doing loads for this.  Did you review other ISAs (those I'm not
>>>> familiar with myself too much are SVE, RISC-V and GCN) in GCC whether
>>>> they have similar support and whether your approach can be supported
>>>> there?  ISTR SVE must have some similar support - what's the reason
>>>> you do not piggy-back on that?
>>>
>>> I may miss something, but I didn't find there is any support that meet with
>>> this in GCC.  Good suggestion on ISAs, I didn't review other ISAs, for the
>>> current implementation, I heavily referred to existing SVE fully predicated
>>> loop support, it's stronger than "with length", it can deal with arbitrary
>>> elements, only perform for the loop fully etc.
>>>
>>>>
>>>> I think a load like I described above might be represented as
>>>>
>>>> _1 = __VIEW_CONVERT <v4df_t> (__MEM <double[n_2]> ((double *)p_3));
>>>>
>>>> not sure if that actually works out though.  But given it seems it
>>>> is a contiguous load we shouldn't need an internal function here?
>>>> [there's a possible size mismatch in the __VIEW_CONVERT above, I guess
>>>> on RTL you end up with a paradoxical subreg - or an UNSPEC]
>>>
>>> IIUC, what you suggested is to avoid use IFN here.  May I know the reason?
>>> is it due to the maintenance efforts on various IFNs?  I thought using
>>> IFN is gentle.  And agreed, I had the concern that the size mismatch
>>> problem here since the length can be large (extremely large probably, it
>>> depends on target saturated limit), the converted vector size can be small.
>>> Besides, the length can be a variable.
>>>
>>>>
>>>> That said, I'm not very happy seeing yet another way of doing loads
>>>> [for fully predicated loops].  I'd rather like to see a single
>>>> representation on GIMPLE at least.
>>>
>>> OK.  Does it mean IFN isn't counted into this scope?  :)
>> 
>> Sure going with a new IFN is a possibility.  I'd just avoid that if
>> possible ;)  How does SVE represent loads/stores for whilelt loops?
>> Would it not be possible to share that representation somehow?
>> 
>> Richard.
>> 
>
> Got it.  :) Currently SVE uses IFNs .MASK_LOAD and .MASK_STORE for 
> whileult loops:
>
>   vect__1.5_14 = .MASK_LOAD (_11, 4B, loop_mask_9);
>   ...
>   .MASK_STORE (_1, 4B, loop_mask_9, vect__3.9_18);
>   ...
>   next_mask_26 = .WHILE_ULT (_2, 127, { 0, ... });
>   if (next_mask_26 != { 0, ... })
>
> I thought to share it once, but didn't feel it's a good idea since it's
> like we take something as masks that isn't actually masks, easily confuse
> people.  OTOH, Power or other potential targets probably supports these
> kinds of masking instructions, easily to confuse people further.
>
> But I definitely agree that we can refactor some codes to share with
> each other when possible.
>
> Btw, here the code with proposed IFNs looks like:
>
>   vect__1.5_4 = .LEN_LOAD (_15, 4B, loop_len_5);
>   ...
>   .LEN_STORE (_1, 4B, loop_len_5, vect__3.9_18);
>   ivtmp_23 = ivtmp_22 + 16;
>   _25 = MIN_EXPR <ivtmp_23, 508>;
>   _26 = 508 - _25;
>   _27 = MIN_EXPR <_26, 16>;
>   if (ivtmp_23 <= 507)

FWIW, I agree adding .LEN_LOAD and .LEN_STORE seems like a good
approach.  I think it'll be more maintainable in the long run than
trying to have .MASK_LOADs and .MASK_STOREs that need a special mask
operand.  (That would be too similar to VEC_COND_EXPR :-))

Not sure yet what the exact semantics wrt out-of-range values for
the IFN/optab though.  Maybe we should instead have some kind of
abstract, target-specific cookie created by a separate intrinsic.
Haven't thought much about it yet...

Thanks,
Richard

Reply via email to