Hi!

on 2023/4/12 19:37, juzhe.zh...@rivai.ai wrote:
> Thank you. Richard. >> I think that already works for them (could be 
> misremembering). >> However, IIUC, they have no special instruction to 
> calculate the >> length (unlike for RVV), and so it's open-coded using 
> vect_get_len. 
> 
> Thank you. Richard.
> 
> 
>>> I think that already works for them (could be misremembering).
>>> However, IIUC, they have no special instruction to calculate the
>>> length (unlike for RVV), and so it's open-coded using vect_get_len.
> 

Yeah, Richard is right, we don't have some special hardware instruction
for efficient length calculation.

> Yeah, the current flow using min, sub, and then min in vect_get_len
> is working for IBM. But I wonder whether switching the current flow of
> length-loop-control into the WHILE_LEN pattern that this patch can improve
> their performance.

Based on some cons for the vector load/store with length in bytes on Power
(like we need one extra GPR holding the length, the length needs to be the
most significant 8 bits requiring an extra shifting etc.), we use normal
vector load/store in main loop and only use vector load/store with length
for the epilogue.  For the epilogue, the remaining length is known less
than the whole vector length, so the related sequence can be optimized.
I just had a check on s390 code, which also enables it only for the
epilogue.  From this perspective, this WHILE_LEN proposal may not give us
more.  But for the case of vect-partial-vector-usage=2 (fully adopting
vector with length on the main loop), I think the proposed sequence looks
better to me.

BR,
Kewen

Reply via email to