Hi, Richard. >> Yes, I think VEC_EXTRACT is suitable for this.
Thanks for suggestion. Actually I was trying to use VEC_EXTRACT yesterday but it ICE since GCC failed to create LCSSA PHI for VEC_EXTRACT. For example, like ARM SVE: https://godbolt.org/z/rfbb4rfKv vect dump IR: ;; Created LCSSA PHI: loop_mask_36 = PHI <loop_mask_22(3)> <bb 8> [local count: 105119324]: # vect_last_12.8_24 = PHI <vect_last_12.8_23(3)> # loop_mask_36 = PHI <loop_mask_22(3)> _25 = .EXTRACT_LAST (loop_mask_36, vect_last_12.8_24); Then it can work. I was trying to do the similar thing but with VEC_EXTRACT as follows for RVV: ... loop_len_36 = SELECT_VL .... # vect_last_12.8_24 = PHI <vect_last_12.8_23(3)> // Missed a LCSSA PHI. _25 = .VEC_EXTRACT (loop_len_36, vect_last_12.8_24); This Gimple IR will cause an ICE. When I use EXTRACT_LAST instead of VEC_EXTRACT, then: ... loop_len_36 = SELECT_VL .... # vect_last_12.8_24 = PHI <vect_last_12.8_23(3)> # loop_len_22 = PHI <loop_len_36 (3)> _25 = .VEC_EXTRACT (loop_len_22, vect_last_12.8_24); Then it works. I didn't figure out where to make GCC recognize VEC_EXTRACT to generate LCSSA PHI for VEC_EXTRACT. Could you give me some help for this? Thanks. juzhe.zh...@rivai.ai From: Richard Biener Date: 2023-08-09 22:17 To: 钟居哲 CC: richard.sandiford; gcc-patches Subject: Re: Re: [PATCH] VECT: Support loop len control on EXTRACT_LAST vectorization On Wed, 9 Aug 2023, ??? wrote: > Hi, Richard. > > >> I'm a bit behind of email, but why isn't BIT_FIELD_REF enough for > >> the case that the patch is handling? It seems that: > > >> .EXTRACT_LAST (len, vec) > > >> is equivalent to: > > >> vec[len - 1] > > >> I think eventually there'll be the temptation to lower/fold it like that. > > Current BIT_FIELD_REF doesn't make use of LOOP_LEN. Yes, BIT_FIELD_REF doesn't support variable offset. > Consider this following case: > > #include <stdint.h> > #define EXTRACT_LAST(TYPE) \ > TYPE __attribute__ ((noinline, noclone)) \ > test_##TYPE (TYPE *x, int n, TYPE value) \ > { \ > TYPE last; \ > for (int j = 0; j < n; ++j) \ > { \ > last = x[j]; \ > x[j] = last * value; \ > } \ > return last; \ > } > #define TEST_ALL(T) \ > T (uint8_t) \ > TEST_ALL (EXTRACT_LAST) > > The assembly: > https://godbolt.org/z/z1PPT948b > > test_uint8_t: > mv a3,a0 > ble a1,zero,.L10 > addiw a5,a1,-1 > li a4,14 > sext.w a0,a1 > bleu a5,a4,.L11 > srliw a4,a0,4 > slli a4,a4,4 > mv a5,a3 > add a4,a4,a3 > vsetivli zero,16,e8,m1,ta,ma > vmv.v.x v3,a2 > .L4: > vl1re8.v v1,0(a5) > vmul.vv v2,v1,v3 > vs1r.v v2,0(a5) > addi a5,a5,16 > bne a4,a5,.L4 > andi a4,a1,-16 > mv a5,a4 > vsetivli zero,8,e8,mf2,ta,ma > beq a0,a4,.L16 > .L3: > subw a0,a0,a4 > addiw a7,a0,-1 > li a6,6 > bleu a7,a6,.L7 > slli a4,a4,32 > srli a4,a4,32 > add a4,a3,a4 > andi a6,a0,-8 > vle8.v v2,0(a4) > vmv.v.x v1,a2 > andi a0,a0,7 > vmul.vv v1,v1,v2 > vse8.v v1,0(a4) > addw a5,a6,a5 > beq a0,zero,.L8 > .L7: > add a6,a3,a5 > lbu a0,0(a6) > addiw a4,a5,1 > mulw a7,a0,a2 > sb a7,0(a6) > bge a4,a1,.L14 > add a4,a3,a4 > lbu a0,0(a4) > addiw a6,a5,2 > mulw a7,a2,a0 > sb a7,0(a4) > ble a1,a6,.L14 > add a6,a3,a6 > lbu a0,0(a6) > addiw a4,a5,3 > mulw a7,a2,a0 > sb a7,0(a6) > ble a1,a4,.L14 > add a4,a3,a4 > lbu a0,0(a4) > addiw a6,a5,4 > mulw a7,a2,a0 > sb a7,0(a4) > ble a1,a6,.L14 > add a6,a3,a6 > lbu a0,0(a6) > addiw a4,a5,5 > mulw a7,a2,a0 > sb a7,0(a6) > ble a1,a4,.L14 > add a4,a3,a4 > lbu a0,0(a4) > addiw a5,a5,6 > mulw a6,a2,a0 > sb a6,0(a4) > ble a1,a5,.L14 > add a3,a3,a5 > lbu a0,0(a3) > mulw a2,a2,a0 > sb a2,0(a3) > ret > .L10: > li a0,0 > .L14: > ret > .L8: > vslidedown.vi v2,v2,7 > vmv.x.s a0,v2 > andi a0,a0,0xff > ret > .L11: > li a4,0 > li a5,0 > vsetivli zero,8,e8,mf2,ta,ma > j .L3 > .L16: > vsetivli zero,16,e8,m1,ta,ma > vslidedown.vi v1,v1,15 > vmv.x.s a0,v1 > andi a0,a0,0xff > ret > > > This patch is trying to optimize the codegen for RVV for length control, > after this patch: > > Gimple IR: > > <bb 4> [local count: 955630224]: > # vectp_x.6_22 = PHI <vectp_x.6_23(4), x_11(D)(3)> > # vectp_x.10_30 = PHI <vectp_x.10_31(4), x_11(D)(3)> > # ivtmp_34 = PHI <ivtmp_35(4), _33(3)> > _36 = .SELECT_VL (ivtmp_34, 16); > vect_last_12.8_24 = .MASK_LEN_LOAD (vectp_x.6_22, 8B, { -1, -1, -1, -1, -1, > -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1 }, _36, 0); > vect__4.9_28 = vect_last_12.8_24 * vect_cst__27; > .MASK_LEN_STORE (vectp_x.10_30, 8B, { -1, -1, -1, -1, -1, -1, -1, -1, -1, > -1, -1, -1, -1, -1, -1, -1 }, _36, 0, vect__4.9_28); > vectp_x.6_23 = vectp_x.6_22 + _36; > vectp_x.10_31 = vectp_x.10_30 + _36; > ivtmp_35 = ivtmp_34 - _36; > if (ivtmp_35 != 0) > goto <bb 4>; [89.00%] > else > goto <bb 5>; [11.00%] > > <bb 5> [local count: 105119324]: > _26 = .EXTRACT_LAST (_36, vect_last_12.8_24); [tail call] > > ASM: > test_uint8_t: > ble a1,zero,.L4 > mv a4,a0 > vsetivli zero,16,e8,m1,ta,ma > vmv.v.x v3,a2 > .L3: > vsetvli a5,a1,e8,m1,ta,ma > vle8.v v1,0(a0) > vsetivli zero,16,e8,m1,ta,ma > sub a1,a1,a5 > vmul.vv v2,v1,v3 > vsetvli zero,a5,e8,m1,ta,ma > vse8.v v2,0(a4) > add a0,a0,a5 > add a4,a4,a5 > bne a1,zero,.L3 > addi a5,a5,-1 > vsetivli zero,16,e8,m1,ta,ma > vslidedown.vx v1,v1,a5 > vmv.x.s a0,v1 > andi a0,a0,0xff > ret > .L4: > li a0,0 > ret > > I think this codegen is much better with this patch. > > >> FWIW, I agree a IFN_LEN_EXTRACT_LAST/IFN_EXTRACT_LAST_LEN would be OK, > >> with a mask, vector, length and bias. But even then, I think there'll > >> be a temptation to lower calls with all-1 masks to val[len - 1 - bias]. > >> So I think the function only makes sense if we have a use case where > >> the mask might not be all-1s. > > I am wondering that, instead of adding > IFN_LEN_EXTRACT_LAST/IFN_EXTRACT_LAST_LEN. > Does it make sense I use VEC_EXTRACT as follows :? > > Loop: > loop_len_22 = SELECT_VL. > .... > > Epilogue: > new_loop_len_22 = PHI <loop_len_22> > scalar_res = VEC_EXTRACT (v, new_loop_len_22 - 1 - BIAS) Yes, I think VEC_EXTRACT is suitable for this. Richard. > Feel free to correct me if I am wrong. > > Thanks. > > > juzhe.zh...@rivai.ai > > From: Richard Sandiford > Date: 2023-08-09 21:30 > To: juzhe.zhong\@rivai.ai > CC: rguenther; gcc-patches > Subject: Re: [PATCH] VECT: Support loop len control on EXTRACT_LAST > vectorization > "juzhe.zh...@rivai.ai" <juzhe.zh...@rivai.ai> writes: > > Hi, Richi. > > > >>> that should be > > > >>> || (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo) > >>> && !LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)) > > > >>> I think. It seems to imply that SLP isn't supported with > >>> masking/lengthing. > > > > Oh, yes. At first glance, the original code is quite suspicious and your > > comments make sense to me. > > > >>> Hum, how does CFN_EXTRACT_LAST handle both mask and length transparently? > >>> Don't you need some CFN_LEN_EXTRACT_LAST instead? > > > > I think CFN_EXTRACT_LAST always has either loop mask or loop len. > > > > When both mask and length are not needed, IMHO, I think current > > BIT_FIELD_REF flow is good enough: > > https://godbolt.org/z/Yr5M9hcc6 > > I'm a bit behind of email, but why isn't BIT_FIELD_REF enough for > the case that the patch is handling? It seems that: > > .EXTRACT_LAST (len, vec) > > is equivalent to: > > vec[len - 1] > > I think eventually there'll be the temptation to lower/fold it like that. > > FWIW, I agree a IFN_LEN_EXTRACT_LAST/IFN_EXTRACT_LAST_LEN would be OK, > with a mask, vector, length and bias. But even then, I think there'll > be a temptation to lower calls with all-1 masks to val[len - 1 - bias]. > So I think the function only makes sense if we have a use case where > the mask might not be all-1s. > > Thanks, > Richard > > -- Richard Biener <rguent...@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)