Re: Re: [PATCH] RISC-V: Bugfix for rvv bool mode precision adjustment

Richard Biener via Gcc-patches Thu, 02 Mar 2023 00:25:26 -0800

On Thu, 2 Mar 2023, juzhe.zh...@rivai.ai wrote:

> >> Does the eventual value set by ADJUST_BYTESIZE equal the real number of
> >> bytes loaded by vlm.v and stored by vstm.v (after the appropriate vsetvl)?
> >> Or is the GCC size larger in some cases than the number of bytes
> >> loaded and stored?
> For VNx1BI,VNx2BI,VNx4BI,VNx8BI, we allocate the larger size of memory or 
> stack for register spillling
> according to ADJUST_BYTESIZE. 
> After appropriate vsetvl, VNx1BI is loaded/stored 1/8 of ADJUST_BYTESIZE 
> (vsetvl e8mf8).
> After appropriate vsetvl, VNx2BI is loaded/stored 2/8 of ADJUST_BYTESIZE 
> (vsetvl e8mf2).
> After appropriate vsetvl, VNx4BI is loaded/stored 4/8 of ADJUST_BYTESIZE 
> (vsetvl e8mf4).
> After appropriate vsetvl, VNx8BI is loaded/stored 8/8 of ADJUST_BYTESIZE 
> (vsetvl e8m1).
> 
> Note: except these 4 machine modes, all other machine modes of RVV, 
> ADJUST_BYTESIZE
> are equal to the real number of bytes of load/store instruction that RVV ISA 
> define.
> 
> Well, as I said, it's fine that we allocated larger memory for 
> VNx1BI,VNx2BI,VNx4BI, 
> we can emit appropriate vsetvl to gurantee the correctness in RISC-V backward 
> according 
> to the machine_mode as long as long GCC didn't do the incorrect elimination 
> in middle-end.
> 
> Besides, poly (1,1) is 1/8 of machine vector-length which is already really a 
> small number,
> which is the real number bytes loaded/stored for VNx8BI.
> You can say VNx1BI, VNx2BI, VNx4BI are consuming larger memory than we 
> actually load/stored by appropriate vsetvl
> since they are having same ADJUST_BYTESIZE as VNx8BI. However, I think it's 
> totally fine so far as long as we can
> gurantee the correctness and I think optimizing such memory storage consuming 
> is trivial.
> 
> >> And does it equal the size of the corresponding LLVM machine type?
> 
> Well, for some reason, in case of register spilling, LLVM consume much more 
> memory than GCC.
> And they always do whole register load/store (a single vector register 
> vector-length) for register spilling.
> That's another story (I am not going to talk too much about this since it's a 
> quite ugly implementation). 
> They don't model the types accurately according RVV ISA for register spilling.
> 
> In case of normal load/store like:
> vbool8_t v2 = *(vbool8_t*)in;  *(vbool8_t*)(out + 100) = v2;
> This kind of load/store, their load/stores instructions of codegen are 
> accurate.
> Even though their instructions are accurate for load/store accessing 
> behavior, I am not sure whether size 
> of their machine type is accurate.
> 
> For example, in IR presentation: VNx1BI of GCC is represented as vscale x 1 x 
> i1
>           VNx2BI of GCC is represented as vscale x 2 x i1
> in LLVM IR.
> I am not sure the bytesize of  vscale x 1 x i1 and vscale x 2 x i1.
> I didn't take a deep a look at it.
> 
> I think this question is not that important, no matter whether VNx1BI and 
> VNx2BI are modeled accurately in case of ADUST_BYTESIZE
> in GCC or  vscale x 1 x i1 and vscale x 2 x i1 are modeled accurately in case 
> of  their bytesize,
> I think as long as we can emit appropriate vsetvl + vlm/vsm, it's totally 
> fine for RVV  even though in some case, their memory allocation
> is not accurate in compiler.


I'm not sure how it works for variable-length types but isn't
sizeof (vbool8_t) part of the ABI and thus its TYPE_SIZE / GET_MODE_SIZE
are relevant there?  It might of course be that you can never have
these types as part of aggregates, arrays or objects of them address-taken
in which case the issue is moot?

Richard.

> 
> juzhe.zh...@rivai.ai
>  
> From: Richard Sandiford
> Date: 2023-03-02 00:14
> To: Li\, Pan2
> CC: juzhe.zhong\@rivai.ai; rguenther; gcc-patches; Pan Li; kito.cheng
> Subject: Re: [PATCH] RISC-V: Bugfix for rvv bool mode precision adjustment
> "Li, Pan2" <pan2...@intel.com> writes:
> > Thanks all for so much valuable and helpful materials.
> >
> > As I understand (Please help to correct me if any mistake.), for the VNx*BI 
> > (aka, 1, 2, 4, 8, 16, 32, 64),
> > the precision and mode size need to be adjusted as below.
> >
> > Precision size [1, 2, 4, 8, 16, 32, 64]
> > Mode size [1, 1, 1, 1, 2, 4, 8]
> >
> > Given that, if we ignore the self-test failure, only the adjust_precision 
> > part is able to fix the bug I mentioned.
> > The genmode will first get the precision, and then leverage the mode_size = 
> > exact_div / 8 to generate.
> > Meanwhile, it also provides the adjust_mode_size after the mode_size 
> > generation.
> >
> > The riscv parts has the mode_size_adjust already and the value of mode_size 
> > will be overridden by the adjustments.
>  
> Ah, OK!  In that case, would the following help:
>  
> Turn:
>  
>   mode_size[E_%smode] = exact_div (mode_precision[E_%smode], BITS_PER_UNIT);
>  
> into:
>  
>   if (!multiple_p (mode_precision[E_%smode], BITS_PER_UNIT,
>    &mode_size[E_%smode]))
>     mode_size[E_%smode] = -1;
>  
> where -1 is an "obviously wrong" value.
>  
> Ports that might hit the -1 are then responsible for setting the size
> later, via ADJUST_BYTESIZE.
>  
> After all the adjustments are complete, genmodes asserts that no size is
> known_eq to -1.
>  
> That way, target-independent code doesn't need to guess what the
> correct behaviour is.
>  
> Does the eventual value set by ADJUST_BYTESIZE equal the real number of
> bytes loaded by vlm.v and stored by vstm.v (after the appropriate vsetvl)?
> And does it equal the size of the corresponding LLVM machine type?
> Or is the GCC size larger in some cases than the number of bytes
> loaded and stored?
>  
> (You and Juzhe have probably answered that question before, sorry,
> but I'm still not 100% sure of the answer.  Personally, I think I would
> find the ISA behaviour easier to understand if the explanation doesn't
> involve poly_ints.  It would be good to understand things "as the
> architecture sees then" rather than in terms of GCC concepts.)
>  
> Thanks,
> Richard
>  
> > Unfortunately, the early stage mode_size generation leveraged exact_div, 
> > which doesn't honor precision size < 8
> > with the adjustment and fails on exact_div assertions.
> >
> > Besides the precision adjustment, I am not sure if we can narrow down the 
> > problem to.
> >
> >
> >   1.  Defined the real size of both the precision and mode size to align 
> > the riscv ISA.
> >   2.  Besides, make the general mode_size = precision_size / 8 is able to 
> > take care of both the exact_div and the dividend less than the divisor 
> > (like 1/8 or 2/8) cases.
> >
> > Could you please share your professional suggestions about this? Thank you 
> > all again and have a nice day!
> >
> > Pan
> >
> > From: juzhe.zh...@rivai.ai <juzhe.zh...@rivai.ai>
> > Sent: Wednesday, March 1, 2023 10:19 PM
> > To: rguenther <rguent...@suse.de>
> > Cc: richard.sandiford <richard.sandif...@arm.com>; gcc-patches 
> > <gcc-patches@gcc.gnu.org>; Pan Li <incarnation.p....@outlook.com>; Li, Pan2 
> > <pan2...@intel.com>; kito.cheng <kito.ch...@sifive.com>
> > Subject: Re: Re: [PATCH] RISC-V: Bugfix for rvv bool mode precision 
> > adjustment
> >
> >>> So given the above I think that modeling the size as being the same
> >>> but with accurate precision would work.  It's then only the size of the
> >>> padding in bytes we cannot represent with poly-int which should be fine.
> >
> >>> Correct?
> > Yes.
> >
> >>> Btw, is storing a VNx1BI and then loading a VNx2BI from the same
> >>> memory address well-defined?  That is, how is the padding handled
> >>> by the machine load/store instructions?
> >
> > storing VNx1BI is storing the data from addr 0 ~ 1/8 poly (1,1) and keep 
> > addr 1/8  poly (1,1) ~ 2/8  poly (1,1) memory data unchange.
> > load VNx2BI will load 0 ~ 2/8  poly (1,1), note that 0 ~ 1/8 poly (1,1) is 
> > the date that we store above, 1/8  poly (1,1) ~ 2/8  poly (1,1)  is the 
> > orignal memory data.
> > You can see here for this case (LLVM):
> > https://godbolt.org/z/P9e1adrd3
> > foo:                                    # @foo
> >         vsetvli a2, zero, e8, mf8, ta, ma
> >         vsm.v   v0, (a0)
> >         vsetvli a2, zero, e8, mf4, ta, ma
> >         vlm.v   v8, (a0)
> >         vsm.v   v8, (a1)
> >         ret
> >
> > We can also doing like this in GCC as long as we can differentiate VNx1BI 
> > and VNx2BI, and GCC do not eliminate statement according precision even 
> > though
> > they have same bytesize.
> >
> > First we emit vsetvl e8mf8 +vsm for VNx1BI
> > Then we emit vsetvl e8mf8 + vlm for VNx2BI
> >
> > Thanks.
> > ________________________________
> > juzhe.zh...@rivai.ai<mailto:juzhe.zh...@rivai.ai>
> >
> > From: Richard Biener<mailto:rguent...@suse.de>
> > Date: 2023-03-01 22:03
> > To: juzhe.zhong<mailto:juzhe.zh...@rivai.ai>
> > CC: richard.sandiford<mailto:richard.sandif...@arm.com>; 
> > gcc-patches<mailto:gcc-patches@gcc.gnu.org>; Pan 
> > Li<mailto:incarnation.p....@outlook.com>; 
> > pan2.li<mailto:pan2...@intel.com>; kito.cheng<mailto:kito.ch...@sifive.com>
> > Subject: Re: Re: [PATCH] RISC-V: Bugfix for rvv bool mode precision 
> > adjustment
> > On Wed, 1 Mar 2023, Richard Biener wrote:
> >
> >> On Wed, 1 Mar 2023, juzhe.zh...@rivai.ai<mailto:juzhe.zh...@rivai.ai> 
> >> wrote:
> >>
> >> > Let's me first introduce RVV load/store basics  and stack allocation.
> >> > For scalable vector memory allocation, we allocate memory according to 
> >> > machine vector-length.
> >> > To get this CPU vector-length value (runtime invariant but compile time 
> >> > unknown), we have an instruction call csrr vlenb.
> >> > For example, csrr a5,vlenb (store CPU a single register vector-length 
> >> > value (describe as bytesize) in a5 register).
> >> > A single register size in bytes (GET_MODE_SIZE) is poly value (8,8) 
> >> > bytes. That means csrr a5,vlenb, a5 has the value of size poly (8,8) 
> >> > bytes.
> >> >
> >> > Now, our problem is that VNx1BI, VNx2BI, VNx4BI, VNx8BI has the same 
> >> > bytesize poly (1,1). So their storage consumes the same size.
> >> > Meaning when we want to allocate a memory storge or stack for register 
> >> > spillings, we should first csrr a5, vlenb, then slli a5,a5,3 (means a5 = 
> >> > a5/8)
> >> > Then, a5 has the bytesize value of poly (1,1). All VNx1BI, VNx2BI, 
> >> > VNx4BI, VNx8BI are doing the same process as I described above. They all 
> >> > consume
> >> > the same memory storage size since we can't model them accurately 
> >> > according to precision or you bitsize.
> >> >
> >> > They consume the same storage (I am agree it's better to model them more 
> >> > accurately in case of memory storage comsuming).
> >> >
> >> > Well, even though they are consuming same size memory storage, I can 
> >> > make their memory accessing behavior (load/store) accurately by
> >> > emiting  the accurate RVV instruction for them according to RVV ISA.
> >> >
> >> > VNx1BI,VNx2BI, VNx4BI, VNx8BI are consuming same memory storage with 
> >> > size  poly (1,1)
> >> > The instruction for these modes as follows:
> >> > VNx1BI: vsevl e8mf8 + vlm,  loading 1/8 of poly (1,1) storage.
> >> > VNx2BI: vsevl e8mf8 + vlm,  loading 1/4 of poly (1,1) storage.
> >> > VNx4BI: vsevl e8mf8 + vlm,  loading 1/2 of poly (1,1) storage.
> >> > VNx8BI: vsevl e8mf8 + vlm,  loading 1 of poly (1,1) storage.
> >> >
> >> > So base on these, It's fine that we don't model VNx1BI,VNx2BI, VNx4BI, 
> >> > VNx8BI accurately according to precision or bitsize.
> >> > This implementation is fine even though their memory storage is not 
> >> > accurate.
> >> >
> >> > However, the problem is that since they have the same bytesize, GCC will 
> >> > think they are the same and do some incorrect statement elimination:
> >> >
> >> > (Note: Load same memory base)
> >> > load v0 VNx1BI from base0
> >> > load v1 VNx2BI from base0
> >> > load v2 VNx4BI from base0
> >> > load v3 VNx8BI from base0
> >> >
> >> > store v0 base1
> >> > store v1 base2
> >> > store v2 base3
> >> > store v3 base4
> >> >
> >> > This program sequence, in GCC, it will eliminate the last 3 load 
> >> > instructions.
> >> >
> >> > Then it will become:
> >> >
> >> > load v0 VNx1BI from base0 ===> vsetvl e8mf8 + vlm (only load 1/8 of poly 
> >> > size (1,1) memory data)
> >> >
> >> > store v0 base1
> >> > store v0 base2
> >> > store v0 base3
> >> > store v0 base4
> >> >
> >> > This is what we want to fix. I think as long as we can have the way to 
> >> > differentiate VNx1BI,VNx2BI, VNx4BI, VNx8BI
> >> > and GCC will not do th incorrect elimination for RVV.
> >> >
> >> > I think it can work fine  even though these 4 modes consume inaccurate 
> >> > memory storage size
> >> > but accurate data memory access load store behavior.
> >>
> >> So given the above I think that modeling the size as being the same
> >> but with accurate precision would work.  It's then only the size of the
> >> padding in bytes we cannot represent with poly-int which should be fine.
> >>
> >> Correct?
> >
> > Btw, is storing a VNx1BI and then loading a VNx2BI from the same
> > memory address well-defined?  That is, how is the padding handled
> > by the machine load/store instructions?
> >
> > Richard.
>  
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)

Re: Re: [PATCH] RISC-V: Bugfix for rvv bool mode precision adjustment

Reply via email to