Hi! on 2023/4/12 19:37, juzhe.zh...@rivai.ai wrote: > Thank you. Richard. >> I think that already works for them (could be > misremembering). >> However, IIUC, they have no special instruction to > calculate the >> length (unlike for RVV), and so it's open-coded using > vect_get_len. > > Thank you. Richard. > > >>> I think that already works for them (could be misremembering). >>> However, IIUC, they have no special instruction to calculate the >>> length (unlike for RVV), and so it's open-coded using vect_get_len. >
Yeah, Richard is right, we don't have some special hardware instruction for efficient length calculation. > Yeah, the current flow using min, sub, and then min in vect_get_len > is working for IBM. But I wonder whether switching the current flow of > length-loop-control into the WHILE_LEN pattern that this patch can improve > their performance. Based on some cons for the vector load/store with length in bytes on Power (like we need one extra GPR holding the length, the length needs to be the most significant 8 bits requiring an extra shifting etc.), we use normal vector load/store in main loop and only use vector load/store with length for the epilogue. For the epilogue, the remaining length is known less than the whole vector length, so the related sequence can be optimized. I just had a check on s390 code, which also enables it only for the epilogue. From this perspective, this WHILE_LEN proposal may not give us more. But for the case of vect-partial-vector-usage=2 (fully adopting vector with length on the main loop), I think the proposed sequence looks better to me. BR, Kewen