Thanks Kewen. It seems that this proposal WHILE_LEN can help s390 when using --param vect-partial-vector-usage=2 compile option.
Would you mind apply this patch && support WHILE_LEN in s390 backend and test it to see the overal benefits for s390 as well as the correctness of this sequence ? If it may create some correctness issue for s390 or rs6000 (I saw len_load/len_store in rs6000 too), I can fix this patch for you. I hope both RVV and IBM targets can gain benefits from this patch. Thanks. juzhe.zh...@rivai.ai From: Kewen.Lin Date: 2023-04-12 20:56 To: juzhe.zh...@rivai.ai; richard.sandiford; rguenther CC: gcc-patches; jeffreyalaw; rdapp Subject: Re: [PATCH] VECT: Add WHILE_LEN pattern for decrement IV support for auto-vectorization Hi! on 2023/4/12 19:37, juzhe.zh...@rivai.ai wrote: > Thank you. Richard. >> I think that already works for them (could be > misremembering). >> However, IIUC, they have no special instruction to > calculate the >> length (unlike for RVV), and so it's open-coded using > vect_get_len. > > Thank you. Richard. > > >>> I think that already works for them (could be misremembering). >>> However, IIUC, they have no special instruction to calculate the >>> length (unlike for RVV), and so it's open-coded using vect_get_len. > Yeah, Richard is right, we don't have some special hardware instruction for efficient length calculation. > Yeah, the current flow using min, sub, and then min in vect_get_len > is working for IBM. But I wonder whether switching the current flow of > length-loop-control into the WHILE_LEN pattern that this patch can improve > their performance. Based on some cons for the vector load/store with length in bytes on Power (like we need one extra GPR holding the length, the length needs to be the most significant 8 bits requiring an extra shifting etc.), we use normal vector load/store in main loop and only use vector load/store with length for the epilogue. For the epilogue, the remaining length is known less than the whole vector length, so the related sequence can be optimized. I just had a check on s390 code, which also enables it only for the epilogue. From this perspective, this WHILE_LEN proposal may not give us more. But for the case of vect-partial-vector-usage=2 (fully adopting vector with length on the main loop), I think the proposed sequence looks better to me. BR, Kewen