Thanks Kewen. 

It seems that this proposal WHILE_LEN can help s390 when using --param 
vect-partial-vector-usage=2 compile option.

Would you mind apply this patch && support WHILE_LEN in s390 backend and test 
it to see the overal benefits for s390
as well as the correctness of this sequence ? 
If it may create some correctness issue for s390 or rs6000 (I saw 
len_load/len_store in rs6000 too), I can fix this patch for you.

I hope both RVV and IBM targets can gain benefits from this patch.

Thanks.


juzhe.zh...@rivai.ai
 
From: Kewen.Lin
Date: 2023-04-12 20:56
To: juzhe.zh...@rivai.ai; richard.sandiford; rguenther
CC: gcc-patches; jeffreyalaw; rdapp
Subject: Re: [PATCH] VECT: Add WHILE_LEN pattern for decrement IV support for 
auto-vectorization
Hi!
 
on 2023/4/12 19:37, juzhe.zh...@rivai.ai wrote:
> Thank you. Richard. >> I think that already works for them (could be 
> misremembering). >> However, IIUC, they have no special instruction to 
> calculate the >> length (unlike for RVV), and so it's open-coded using 
> vect_get_len. 
> 
> Thank you. Richard.
> 
> 
>>> I think that already works for them (could be misremembering).
>>> However, IIUC, they have no special instruction to calculate the
>>> length (unlike for RVV), and so it's open-coded using vect_get_len.
> 
 
Yeah, Richard is right, we don't have some special hardware instruction
for efficient length calculation.
 
> Yeah, the current flow using min, sub, and then min in vect_get_len
> is working for IBM. But I wonder whether switching the current flow of
> length-loop-control into the WHILE_LEN pattern that this patch can improve
> their performance.
 
Based on some cons for the vector load/store with length in bytes on Power
(like we need one extra GPR holding the length, the length needs to be the
most significant 8 bits requiring an extra shifting etc.), we use normal
vector load/store in main loop and only use vector load/store with length
for the epilogue.  For the epilogue, the remaining length is known less
than the whole vector length, so the related sequence can be optimized.
I just had a check on s390 code, which also enables it only for the
epilogue.  From this perspective, this WHILE_LEN proposal may not give us
more.  But for the case of vect-partial-vector-usage=2 (fully adopting
vector with length on the main loop), I think the proposed sequence looks
better to me.
 
BR,
Kewen
 

Reply via email to