On Wed, 12 Apr 2023, ??? wrote:

> >> It's not so much that we need to do that.  But normally it's only worth
> >> adding internal functions if they do something that is too complicated
> >> to express in simple gimple arithmetic.  The UQDEC case I mentioned:
> 
> >>    z = MAX (x, y) - y
> 
> >> fell into the "simple arithmetic" category for me.  We could have added
> >> an ifn for unsigned saturating decrement, but it didn't seem complicated
> >> enough to merit its own ifn.
> 
> Ah, I known your concern. I should admit that WHILE_LEN is a simple 
> arithmetic operation
> which is just taking result from
> 
> min (remain,vf).
> 
> The possible solution is to just use MIN_EXPR (remain,vf).
> Then, add speciall handling in umin_optab pattern to recognize "vf" in the 
> backend.
> Finally generate vsetvl in RISC-V backend.
> 
> The "vf" should be recognized as the operand of umin should be 
> const_int/const_poly_int operand.
> Otherwise, just generate umin scalar instruction..
> 
> However, there is a case that I can't recognize umin should generate vsetvl 
> or umin. Is this following case:
> void foo (int32_t a)
> {
>   return min (a, 4);
> }
> 
> In this case I should generate:
> li a1,4
> umin a1,a0,a1
> 
> instead of generating vsetvl
> 
> However, in this case:
> 
> void foo (int32_t *a...)
> for (int i = 0; i < n; i++)
>   a[i] = b[i] + c[i];
> 
> with -mriscv-vector-bits=128 (which means each vector can handle 4 INT32)
> Then the VF will be 4 too. If we also MIN_EXPR instead WHILE_LEN:
> 
> ...
> len = MIN_EXPR (n,4)
> v = len_load (len)
> ....
> ...
> 
> In this case, MIN_EXPR should emit vsetvl.
> 
> It's hard for me to tell the difference between these 2 cases...

But the issue is the same in the reverse with WHILE_LEN, no?
WHILE_LEN just computes a scalar value - you seem to suggest
there's a hidden side-effect of "coalescing" the result with
a hardware vector length register?  I don't think that's good design.

IMHO tieing the scalar result with the uses has to be done where
you emit the other vsetvl instructions.

One convenient thing we have with WHILE_LEN is that it is a key
for the vectorizer to query target capabilities (and preferences).
But of course collecting whether stmts can be vectorized
with length and/or with mask would be better.

Richard.

> CC RISC-V port backend maintainer: Kito.
> 
> 
> 
> juzhe.zh...@rivai.ai
>  
> From: Richard Sandiford
> Date: 2023-04-12 20:24
> To: juzhe.zhong\@rivai.ai
> CC: rguenther; gcc-patches; jeffreyalaw; rdapp; linkw
> Subject: Re: [PATCH] VECT: Add WHILE_LEN pattern for decrement IV support for 
> auto-vectorization
> "juzhe.zh...@rivai.ai" <juzhe.zh...@rivai.ai> writes:
> >>> I think that already works for them (could be misremembering).
> >>> However, IIUC, they have no special instruction to calculate the
> >>> length (unlike for RVV), and so it's open-coded using vect_get_len.
> >
> > Yeah, the current flow using min, sub, and then min in vect_get_len
> > is working for IBM. But I wonder whether switching the current flow of
> > length-loop-control into the WHILE_LEN pattern that this patch can improve
> > their performance.
> >
> >>> (1) How easy would it be to express WHILE_LEN in normal gimple?
> >>>     I haven't thought about this at all, so the answer might be
> >>>     "very hard".  But it reminds me a little of UQDEC on AArch64,
> >>>     which we open-code using MAX_EXPR and MINUS_EXPR (see
> >  >>    vect_set_loop_controls_directly).
> >
> >   >>   I'm not saying WHILE_LEN is the same operation, just that it seems
> >   >>   like it might be open-codeable in a similar way.
> >
> >  >>    Even if we can open-code it, we'd still need some way for the
> >   >>   target to select the "RVV way" from the "s390/PowerPC way".
> >
> > WHILE_LEN in doc I define is
> > operand0 = MIN (operand1, operand2)operand1 is the residual number of 
> > scalar elements need to be updated.operand2 is vectorization factor (vf) 
> > for single rgroup.         if multiple rgroup operan2 = vf * 
> > nitems_per_ctrl.You mean such pattern is not well expressed so we need to 
> > replace it with normaltree code (MIN OR MAX). And let RISC-V backend to 
> > optimize them into vsetvl ?Sorry, maybe I am not on the same page.
>  
> It's not so much that we need to do that.  But normally it's only worth
> adding internal functions if they do something that is too complicated
> to express in simple gimple arithmetic.  The UQDEC case I mentioned:
>  
>    z = MAX (x, y) - y
>  
> fell into the "simple arithmetic" category for me.  We could have added
> an ifn for unsigned saturating decrement, but it didn't seem complicated
> enough to merit its own ifn.
>  
> >>> (2) What effect does using a variable IV step (the result of
> >>> the WHILE_LEN) have on ivopts?  I remember experimenting with
> >>> something similar once (can't remember the context) and not
> >>> having a constant step prevented ivopts from making good
> >>> addresing-mode choices.
> >
> > Thank you so much for pointing out this. Currently, varialble IV step and 
> > decreasing n down to 0 
> > works fine for RISC-V downstream GCC and we didn't find issues related 
> > addressing-mode choosing.
>  
> OK, that's good.  Sounds like it isn't a problem then.
>  
> > I think I must missed something, would you mind giving me some hints so 
> > that I can study on ivopts
> > to find out which case may generate inferior codegens for varialble IV step?
>  
> I think AArch64 was sensitive to this because (a) the vectoriser creates
> separate IVs for each base address and (b) for SVE, we instead want
> invariant base addresses that are indexed by the loop control IV.
> Like Richard says, if the loop control IV isn't a SCEV, ivopts isn't
> able to use it and so (b) fails.
>  
> Thanks,
> Richard
>  
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)

Reply via email to